scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2013"


Journal ArticleDOI
TL;DR: A novel parallel intelligent algorithm, namely full connection based parallel adaptive chaos optimization with reflex migration (FC-PACO-RM) is developed, which demonstrates the effectiveness of the proposed method for addressing complex SCOS in CMfg.
Abstract: In order to realize the full-scale sharing, free circulation and transaction, and on-demand-use of manufacturing resource and capabilities in modern enterprise systems (ES), Cloud manufacturing (CMfg) as a new service-oriented manufacturing paradigm has been proposed recently. Compared with cloud computing, the services that are managed in CMfg include not only computational and software resource and capability service, but also various manufacturing resources and capability service. These various dynamic services make ES more powerful and to be a higher-level extension of traditional services. Thus, as a key issue for the implementation of CMfg-based ES, service composition optimal-selection (SCOS) is becoming very important. SCOS is a typical NP-hard problem with the characteristics of dynamic and uncertainty. Solving large scale SCOS problem with numerous constraints in CMfg by using the traditional methods might be inefficient. To overcome this shortcoming, the formulation of SCOS in CMfg with multiple objectives and constraints is investigated first, and then a novel parallel intelligent algorithm, namely full connection based parallel adaptive chaos optimization with reflex migration (FC-PACO-RM) is developed. In the algorithm, roulette wheel selection and adaptive chaos optimization are introduced for search purpose, while full-connection parallelization in island model and new reflex migration way are also developed for efficient decision. To validate the performance of FC-PACO-RM, comparisons with 3 serial algorithms and 7 typical parallel methods are conducted in three typical cases. The results demonstrate the effectiveness of the proposed method for addressing complex SCOS in CMfg.

331 citations


Book
10 Nov 2013
TL;DR: Theoretical Background and Core Univariate Case, including Peano-Type Space-Filling Curves as Means for Multivariate Problems, and Generalizations through Peano Curves.
Abstract: Preface. Acknowledgements. Part One: Global Optimization Algorithms as Decision Procedures. Theoretical Background and Core Univariate Case. 1. Introduction. 2. Global Optimization Algorithms as Statistical Decision Procedures - The Information Approach. 3. Core Global Search Algorithm and Convergence Study. 4. Global Optimization Methods as Bounding Procedures - The Geometric Approach. Part Two: Generalizations for Parallel Computing, Constrained and Multiple Criteria Problems. 5. Parallel Global Optimization Algorithms and Evaluation of the Efficiency of Parallelism. 6. Global Optimization under Non-Convex Constraints - The Index Approach. 7. Algorithms for Multiple Criteria Multiextremal Problems. Part Three: Global Optimization in Many Dimensions. Generalizations through Peano Curves. 8. Peano-Type Space-Filling Curves as Means for Multivariate Problems. 9. Multidimensional Parallel Algorithms. 10. Multiple Peano Scannings and Multidimensional Problems. References. List of Algorithms. List of Figures. List of Tables. Index.

284 citations


Journal ArticleDOI
TL;DR: A new algorithm to solve the basic stream power equation, which governs channel incision and landscape evolution in many geomorphic settings, is presented, which is highly efficient and unconditionally stable.

227 citations


Posted Content
Nitish Korula1, Silvio Lattanzi1
TL;DR: In this article, a small fraction of individuals explicitly link their accounts across multiple online social networks (e.g., Facebook, Twitter, Google+, LinkedIn, etc.) and leverage these connections to identify a very large fraction of the network.
Abstract: People today typically use multiple online social networks (Facebook, Twitter, Google+, LinkedIn, etc.). Each online network represents a subset of their "real" ego-networks. An interesting and challenging problem is to reconcile these online networks, that is, to identify all the accounts belonging to the same individual. Besides providing a richer understanding of social dynamics, the problem has a number of practical applications. At first sight, this problem appears algorithmically challenging. Fortunately, a small fraction of individuals explicitly link their accounts across multiple networks; our work leverages these connections to identify a very large fraction of the network. Our main contributions are to mathematically formalize the problem for the first time, and to design a simple, local, and efficient parallel algorithm to solve it. We are able to prove strong theoretical guarantees on the algorithm's performance on well-established network models (Random Graphs, Preferential Attachment). We also experimentally confirm the effectiveness of the algorithm on synthetic and real social network data sets.

189 citations


Proceedings ArticleDOI
12 Oct 2013
TL;DR: A fast parallel SGD method, FPSGD, for shared memory systems is developed by dramatically reducing the cache-miss rate and carefully addressing the load balance of threads, which is more efficient than state-of-the-art parallel algorithms for matrix factorization.
Abstract: Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient descent (SGD) is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SGD is difficult to be parallelized for handling web-scale problems. In this paper, we develop a fast parallel SGD method, FPSGD, for shared memory systems. By dramatically reducing the cache-miss rate and carefully addressing the load balance of threads, FPSGD is more efficient than state-of-the-art parallel algorithms for matrix factorization.

176 citations


MonographDOI
05 Oct 2013
TL;DR: This book provides a seamless approach to numerical algorithms, modern programming techniques and parallel computing and places equal emphasis on the discretization of partial differential equations and on solvers.
Abstract: This book provides a seamless approach to numerical algorithms, modern programming techniques and parallel computing. These concepts and tools are usually taught serially across different courses and different textbooks, thus observing the connection between them. The necessity of integrating these subjects usually comes after such courses are concluded (e.g., during a first job or a thesis project), thus forcing the student to synthesize what is perceived to be three independent subfields into one in order to produce a solution. The book includes both basic and advanced topics and places equal emphasis on the discretization of partial differential equations and on solvers. Advanced topics include wavelets, high-order methods, non-symmetric systems and parallelization of sparse systems. A CD-ROM accompanies the text.

166 citations


Book
29 Jun 2013
TL;DR: This book constitutes an introduction to distributed computing and is suitable for advanced undergraduate students or graduateStudents in computer science and computer engineering, graduate students in mathematics interested in distributed computing, and practitioners and engineers involved in the design and implementation of distributed applications.
Abstract: Distributed computing is at the heart of many applications. It arises as soon as one has to solve a problem in terms of entities -- such as processes, peers, processors, nodes, or agents -- that individually have only a partial knowledge of the many input parameters associated with the problem. In particular each entity cooperating towards the common goal cannot have an instantaneous knowledge of the current state of the other entities. Whereas parallel computing is mainly concerned with 'efficiency', and real-time computing is mainly concerned with 'on-time computing', distributed computing is mainly concerned with 'mastering uncertainty' created by issues such as the multiplicity of control flows, asynchronous communication, unstable behaviors, mobility, and dynamicity. While some distributed algorithms consist of a few lines only, their behavior can be difficult to understand and their properties hard to state and prove. The aim of this book is to present in a comprehensive way the basic notions, concepts, and algorithms of distributed computing when the distributed entities cooperate by sending and receiving messages on top of an asynchronous network. The book is composed of seventeen chapters structured into six parts: distributed graph algorithms, in particular what makes them different from sequential or parallel algorithms; logical time and global states, the core of the book; mutual exclusion and resource allocation; high-level communication abstractions; distributed detection of properties; and distributed shared memory. The author establishes clear objectives per chapter and the content is supported throughout with illustrative examples, summaries, exercises, and annotated bibliographies. This book constitutes an introduction to distributed computing and is suitable for advanced undergraduate students or graduate students in computer science and computer engineering, graduate students in mathematics interested in distributed computing, and practitioners and engineers involved in the design and implementation of distributed applications. The reader should have a basic knowledge of algorithms and operating systems.

161 citations


Posted Content
TL;DR: A general algorithmic framework that, besides MST, also applies to Earth-Mover Distance (EMD) and the transportation cost problem, and has implications beyond the MapReduce model.
Abstract: We give algorithms for geometric graph problems in the modern parallel models inspired by MapReduce. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the two-dimensional space, our algorithm computes a $(1+\epsilon)$-approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem, despite drawing significant attention in recent years. We develop a general algorithmic framework that, besides MST, also applies to Earth-Mover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in near-linear time, $n^{1+o_\epsilon(1)}$. We note that while recently Sharathkumar and Agarwal developed a near-linear time algorithm for $(1+\epsilon)$-approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in their work. Furthermore, our algorithm immediately gives a $(1+\epsilon)$-approximation algorithm with $n^{\delta}$ space in the streaming-with-sorting model with $1/\delta^{O(1)}$ passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a well-known open problem.

135 citations


Proceedings ArticleDOI
20 May 2013
TL;DR: This work obtains the first communication-optimal algorithm for all dimensions of rectangular matrices by combining the dimension-splitting technique with the recursive BFS/DFS approach, and shows significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.
Abstract: Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

132 citations


Posted Content
TL;DR: In this article, a non-locking, stOchastic multi-machine algorithm for asynchronous and decentralized matrix completion (NOMAD) is proposed. But it is not a lock-free parallel algorithm.
Abstract: We develop an efficient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion). NOMAD is a decentralized algorithm with non-blocking communication between processors. One of the key features of NOMAD is that the ownership of a variable is asynchronously transferred between processors in a decentralized fashion. As a consequence it is a lock-free parallel algorithm. In spite of being an asynchronous algorithm, the variable updates of NOMAD are serializable, that is, there is an equivalent update ordering in a serial implementation. NOMAD outperforms synchronous algorithms which require explicit bulk synchronization after every iteration: our extensive empirical evaluation shows that not only does our algorithm perform well in distributed setting on commodity hardware, but also outperforms state-of-the-art algorithms on a HPC cluster both in multi-core and distributed memory settings.

128 citations


Journal ArticleDOI
TL;DR: Robustness to random variations in electricity price and renewable generation is effected through robust optimization techniques, and real-time extension is also discussed.
Abstract: A demand response (DR) problem is considered entailing a set of devices/subscribers, whose operating conditions are modeled using mixed-integer constraints. Device operational periods and power consumption levels are optimized in response to dynamic pricing information to balance user satisfaction and energy cost. Renewable energy resources and energy storage systems are also incorporated. Since DR becomes more effective as the number of participants grows, scalability is ensured through a parallel distributed algorithm, in which a DR coordinator and DR subscribers solve individual subproblems, guided by certain coordination signals. As the problem scales, the recovered solution becomes near-optimal. Robustness to random variations in electricity price and renewable generation is effected through robust optimization techniques. Real-time extension is also discussed. Numerical tests validate the proposed approach.

Proceedings ArticleDOI
27 Oct 2013
TL;DR: An efficient MPI-based distributed memory parallel algorithm, called PATRIC, for counting triangles in massive networks, which scales well to networks with billions of nodes and can compute the exact number of triangles in a network with one billion nodes and 10 billion edges in 16 minutes.
Abstract: Massive networks arising in numerous application areas poses significant challenges for network analysts as these networks grow to billions of nodes and are prohibitively large to fit in the main memory. Finding the number of triangles in a network is an important problem in the analysis of complex networks. Several interesting graph mining applications depend on the number of triangles in the graph. In this paper, we present an efficient MPI-based distributed memory parallel algorithm, called PATRIC, for counting triangles in massive networks. PATRIC scales well to networks with billions of nodes and can compute the exact number of triangles in a network with one billion nodes and 10 billion edges in 16 minutes. Balancing computational loads among processors for a graph problem like counting triangles is a challenging issue. We present and analyze several schemes for balancing load among processors for the triangle counting problem. These schemes achieve very good load balancing. We also show how our parallel algorithm can adapt an existing edge sparsification technique to approximate the number of triangles with very high accuracy. This modification allows us to count triangles in even larger networks.

Journal ArticleDOI
TL;DR: This work designs and implements efficient parallel community detection heuristics; the first large-scale parallelization of the well-known Louvain method, as well as an extension of the method adding refinement; and an ensemble scheme combining the above.
Abstract: The amount of graph-structured data has recently experienced an enormous growth in many applications. To transform such data into useful information, fast analytics algorithms and software tools are necessary. One common graph analytics kernel is disjoint community detection (or graph clustering). Despite extensive research on heuristic solvers for this task, only few parallel codes exist, although parallelism will be necessary to scale to the data volume of real-world applications. We address the deficit in computing capability by a flexible and extensible community detection framework with shared-memory parallelism. Within this framework we design and implement efficient parallel community detection heuristics: A parallel label propagation scheme; the first large-scale parallelization of the well-known Louvain method, as well as an extension of the method adding refinement; and an ensemble scheme combining the above. In extensive experiments driven by the algorithm engineering paradigm, we identify the most successful parameters and combinations of these algorithms. We also compare our implementations with state-of-the-art competitors. The processing rate of our fastest algorithm often reaches 50M edges/second. We recommend the parallel Louvain method and our variant with refinement as both qualitatively strong and fast. Our methods are suitable for massive data sets with billions of edges.

Proceedings ArticleDOI
23 Jul 2013
TL;DR: Two new parallel algorithms are obtained and it is proved that they match the expected communication cost lower bound, and hence they are optimal.
Abstract: Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts.In this paper, we consider multiplying sparse matrices corresponding to Erdős-Renyi random graphs on distributed-memory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.

01 Jan 2013
TL;DR: Three dissimilar modified k- mean algorithm are discussed which remove the limitation of k-mean algorithm and improve the speed and efficiency of k -mean algorithm.
Abstract: Cluster analysis is a descriptive task that seek to identify homogenous group of object and it is also one of the main analytical method in data mining. K-mean is the most popular partitional clustering method. In this paper we discuss standard k-mean algorithm and analyze the shortcoming of k- mean algorithm. In this paper three dissimilar modified k-mean algorithm are discussed which remove the limitation of k-mean algorithm and improve the speed and efficiency of k-mean algorithm. First algorithm remove the requirement of specifying the value of k in advance practically which is very difficult. This algorithm result in optimal number of cluster Second algorithm reduce computational complexity and remove dead unit problem. It select the most populated area as cluster center. Third algorithm use simple data structure that can be used to store information in each iteration and that information can be used in next iteration. It increase the speed of clustering and reduce time complexity.

Journal ArticleDOI
TL;DR: A comparative experimental study highlights the performance impact of ACO parameters, GPU technical configuration, memory structures and parallelization granularity on a state-of-the-art Fermi GPU architecture.

Journal ArticleDOI
TL;DR: A novel parallel algorithm for the integration of linear initial-value problems is proposed, based on the simple observation that homogeneous problems can typically be integrated much faster than inhomogeneous problems.
Abstract: A novel parallel algorithm for the integration of linear initial-value problems is proposed. This algorithm is based on the simple observation that homogeneous problems can typically be integrated much faster than inhomogeneous problems. An overlapping time-domain decomposition is utilized to obtain decoupled inhomogeneous and homogeneous subproblems, and a near-optimal Krylov method is used for the fast exponential integration of the homogeneous subproblems. We present an error analysis and discuss the parallel scaling of our algorithm. The efficiency of this approach is demonstrated with numerical examples.

Proceedings ArticleDOI
21 Jul 2013
TL;DR: This paper focuses on the development of a decomposition scheme based on the progressive hedging algorithm of Rockafellar and Wets, and makes use of modest-scale parallel computing, representing capabilities either presently deployed, or easily deployed in the near future.
Abstract: Given increasing penetration of variable generation units, there is significant interest in the power systems research community concerning the development of solution techniques that directly address the stochasticity of these sources in the unit commitment problem. Unfortunately, despite significant attention from the research community, stochastic unit commitment solvers have not made their way into practice, due in large part to the computational difficulty of the problem. In this paper, we address this issue, and focus on the development of a decomposition scheme based on the progressive hedging algorithm of Rockafellar and Wets. Our focus is on achieving solve times that are consistent with the requirements of ISO and utilities, on modest-scale instances, using reasonable numbers of scenarios. Further, we make use of modest-scale parallel computing, representing capabilities either presently deployed, or easily deployed in the near future. We demonstrate our progress to date on a test instance representing a simplified version of the US western interconnect (WECC-240).

Proceedings ArticleDOI
08 Apr 2013
TL;DR: In this article, the authors propose to find all instances of a given sample graph in a larger data graph using a single round of map-reduce, using the techniques of multiway joins.
Abstract: The theme of this paper is how to find all instances of a given “sample” graph in a larger “data graph,” using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of [1] for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be “convertible,” in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.

Posted Content
TL;DR: In this paper, a parallel and distributed extension to the alternating direction method of multipliers (ADMM) for solving convex problem is introduced, which decomposes the original problem into N smaller subproblems and solves them in parallel at each iteration.
Abstract: This paper introduces a parallel and distributed extension to the alternating direction method of multipliers (ADMM) for solving convex problem: minimize $\sum_{i=1}^N f_i(x_i)$ subject to $\sum_{i=1}^N A_i x_i=c, x_i\in \mathcal{X}_i$. The algorithm decomposes the original problem into N smaller subproblems and solves them in parallel at each iteration. This Jacobian-type algorithm is well suited for distributed computing and is particularly attractive for solving certain large-scale problems. This paper introduces a few novel results. Firstly, it shows that extending ADMM straightforwardly from the classic Gauss-Seidel setting to the Jacobian setting, from 2 blocks to N blocks, will preserve convergence if matrices $A_i$ are mutually near-orthogonal and have full column-rank. Secondly, for general matrices $A_i$, this paper proposes to add proximal terms of different kinds to the N subproblems so that the subproblems can be solved in flexible and efficient ways and the algorithm converges globally at a rate of o(1/k). Thirdly, a simple technique is introduced to improve some existing convergence rates from O(1/k) to o(1/k). In practice, some conditions in our convergence theorems are conservative. Therefore, we introduce a strategy for dynamically tuning the parameters in the algorithm, leading to substantial acceleration of the convergence in practice. Numerical results are presented to demonstrate the efficiency of the proposed method in comparison with several existing parallel algorithms. We implemented our algorithm on Amazon EC2, an on-demand public computing cloud, and report its performance on very large-scale basis pursuit problems with distributed data.

Journal ArticleDOI
TL;DR: Several techniques are introduced to do optimization on GPUs, including reducing global memory transactions of input buffer, reducing latency of transition table lookup, eliminating output table accesses, avoiding bank-conflict of shared memory, coalescing writes to global memory, and enhancing data transmission via peripheral component interconnect express.
Abstract: Graphics processing units (GPUs) have attracted a lot of attention due to their cost-effective and enormous power for massive data parallel computing. In this paper, we propose a novel parallel algorithm for exact pattern matching on GPUs. A traditional exact pattern matching algorithm matches multiple patterns simultaneously by traversing a special state machine called an Aho-Corasick machine. Considering the particular parallel architecture of GPUs, in this paper, we first propose an efficient state machine on which we perform very efficient parallel algorithms. Also, several techniques are introduced to do optimization on GPUs, including reducing global memory transactions of input buffer, reducing latency of transition table lookup, eliminating output table accesses, avoiding bank-conflict of shared memory, coalescing writes to global memory, and enhancing data transmission via peripheral component interconnect express. We evaluate the performance of the proposed algorithm using attack patterns from Snort V2.8 and input streams from DEFCON. The experimental results show that the proposed algorithm performed on NVIDIA GPUs achieves up to 143.16-Gbps throughput, 14.74 times faster than the Aho-Corasick algorithm implemented on a 3.06-GHz quad-core CPU with the OpenMP. The library of the proposed algorithm is publically accessible through Google Code.

Proceedings ArticleDOI
17 Nov 2013
TL;DR: This paper investigates the shortcomings of the conventional approach in parallel SCC detection and proposes a series of extensions that consider the fundamental properties of real-world graphs, e.g. the small-world property.
Abstract: Detecting strongly connected components (SCCs) in a directed graph is a fundamental graph analysis algorithm that is used in many science and engineering domains. Traditional approaches in parallel SCC detection, however, show limited performance and poor scaling behavior when applied to large real-world graph instances. In this paper, we investigate the shortcomings of the conventional approach and propose a series of extensions that consider the fundamental properties of real-world graphs, e.g. the small-world property. Our scalable implementation offers excellent performance on diverse, small-world graphs resulting in a 5.01x to 29.41x parallel speedup over the optimal sequential algorithm with 16 cores and 32 hardware threads.

Journal ArticleDOI
TL;DR: A general primal-dual splitting algorithm for solving systems of structured coupled monotone inclusions in Hilbert spaces is introduced and its asymptotic behavior is analyzed, providing a flexible solution method applicable to a variety of problems beyond the reach of the state-of-the-art.
Abstract: A general primal-dual splitting algorithm for solving systems of structured coupled monotone inclusions in Hilbert spaces is introduced and its asymptotic behavior is analyzed. Each inclusion in the primal system features compositions with linear operators, parallel sums, and Lipschitzian operators. All the operators involved in this structured model are used separately in the proposed algorithm, most steps of which can be executed in parallel. This provides a flexible solution method applicable to a variety of problems beyond the reach of the state-of-the-art. Several applications are discussed to illustrate this point.

Proceedings ArticleDOI
17 Jun 2013
TL;DR: An intuitive performance model for cache-coherent architectures is developed and used to develop several optimal and optimized algorithms for complex parallel data exchanges that beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries.
Abstract: Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.

Proceedings ArticleDOI
Weina Wang1, Kai Zhu1, Lei Ying1, Jian Tan2, Li Zhang2 
14 Apr 2013
TL;DR: A new queueing architecture is presented and a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy is proposed that is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region.
Abstract: Scheduling map tasks to improve data locality is crucial to the performance of MapReduce. Many works have been devoted to increasing data locality for better efficiency. However, to the best of our knowledge, fundamental limits of MapReduce computing clusters with data locality, including the capacity region and theoretical bounds on the delay performance, have not been studied. In this paper, we address these problems from a stochastic network perspective. Our focus is to strike the right balance between data-locality and load-balancing to simultaneously maximize throughput and minimize delay. We present a new queueing architecture and propose a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy. We identify an outer bound on the capacity region, and then prove that the proposed algorithm stabilizes any arrival rate vector strictly within this outer bound. It shows that the algorithm is throughput optimal and the outer bound coincides with the actual capacity region. Further, we study the number of backlogged tasks under the proposed algorithm, which is directly related to the delay performance based on Little's law. We prove that the proposed algorithm is heavy-traffic optimal, i.e., it asymptotically minimizes the number of backlogged tasks as the arrival rate vector approaches the boundary of the capacity region. Therefore, the proposed algorithm is also delay optimal in the heavy-traffic regime.

Journal ArticleDOI
Sujatha R. Upadhyaya1
TL;DR: Map reduce is another important technique that has evolved during this period and as the literature has it, it has been proved to be an important aid in delivering performance of machine learning algorithms on GPUs.

Proceedings ArticleDOI
23 Feb 2013
TL;DR: StreamScan is a novel approach to implement scan on GPUs with only one computation phase, and the main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely.
Abstract: Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.

Journal ArticleDOI
TL;DR: A parallel coordinate descent algorithm for solving smooth convex optimization problems with separable constraints that may arise, e.g. in distributed model predictive control (MPC) for linear network systems, which has low iteration complexity and is suitable for distributed implementations.

Journal ArticleDOI
TL;DR: A massively parallel method that obeys detailed balance and is able to calculate the equation of state for systems of up to one million hard disks, and discusses the thermodynamics of hard disks separately in a companion paper.

Posted Content
TL;DR: In this paper, a parallel coordinate descent algorithm for solving smooth convex optimization problems with separable constraints is proposed, which is suitable for distributed implementations and has low iteration complexity, which makes it appropriate for embedded control.
Abstract: In this paper we propose a parallel coordinate descent algorithm for solving smooth convex optimization problems with separable constraints that may arise e.g. in distributed model predictive control (MPC) for linear network systems. Our algorithm is based on block coordinate descent updates in parallel and has a very simple iteration. We prove (sub)linear rate of convergence for the new algorithm under standard assumptions for smooth convex optimization. Further, our algorithm uses local information and thus is suitable for distributed implementations. Moreover, it has low iteration complexity, which makes it appropriate for embedded control. An MPC scheme based on this new parallel algorithm is derived, for which every subsystem in the network can compute feasible and stabilizing control inputs using distributed and cheap computations. For ensuring stability of the MPC scheme, we use a terminal cost formulation derived from a distributed synthesis. Preliminary numerical tests show better performance for our optimization algorithm than other existing methods.