There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

Ligra: a lightweight graph processing framework for shared memory

Thank you very much for downloading using mpi portable parallel programming with the message passing interface. As you may know, people have search hundreds times for their chosen novels like this using mpi portable parallel programming with the message passing interface, but end up in harmful downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they are facing with some malicious bugs inside their laptop.

Using Mpi Portable Parallel Programming With The Message Passing Interface

As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

https://dl.acm.org/doi/pdf/10.1145/2788396

A Survey of CPU-GPU Heterogeneous Computing Techniques

The SCIP Optimization Suite provides a collection of software packages for
mathematical optimization centered around the constraint integer programming frame-
work SCIP. This paper discusses enhancements and extensions contained in version 7.0
of the SCIP Optimization Suite. The new version features the parallel presolving library
PaPILO as a new addition to the suite. PaPILO 1.0 simplifies mixed-integer linear op-
timization problems and can be used stand-alone or integrated into SCIP via a presolver
plugin. SCIP 7.0 provides additional support for decomposition algorithms. Besides im-
provements in the Benders’ decomposition solver of SCIP, user-defined decomposition
structures can be read, which are used by the automated Benders’ decomposition solver
and two primal heuristics. Additionally, SCIP 7.0 comes with a tree size estimation
that is used to predict the completion of the overall solving process and potentially
trigger restarts. Moreover, substantial performance improvements of the MIP core were
achieved by new developments in presolving, primal heuristics, branching rules, conflict
analysis, and symmetry handling. Last, not least, the report presents updates to other
components and extensions of the SCIP Optimization Suite, in particular, the LP solver
SoPlex and the mixed-integer semidefinite programming solver SCIP-SDP.

/pdf/the-scip-optimization-suite-7-0-13iz93iexj.pdf

The SCIP Optimization Suite 7.0

Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs.

/pdf/multicore-triangle-computations-without-tuning-4ij4hixctk.pdf

Multicore triangle computations without tuning

Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm). This algorithm has two levels of parallelism. The first level partitions the vertices to the streaming multiprocessors on the GPU. The second level is responsible for parallelizing the work across the GPU's streaming processors and utilizing different block sizes. For testing purposes, we used graphs taken from the DIMACS 10 Graph Challenge. Our experiments were conducted on NVIDIA's K40 GPU. Our GPU triangle counting implementation achieves speedups in the range of 9X -- 32X over a CPU sequential implementation.

Fast triangle counting on the GPU

The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.

Carbon Emissions and Large Neural Network Training.

We present a parallel large neighborhood search framework for finding high quality primal solutions for general mixed-integer programs (MIPs). The approach simultaneously solves a large number of sub-MIPs with the dual objective of reducing infeasibility and optimizing with respect to the original objective. Both goals are achieved by solving restricted versions of two auxiliary MIPs, where subsets of the variables are fixed. In contrast to prior approaches, ours does not require a feasible starting solution. We leverage parallelism to perform multiple searches simultaneously, with the objective of increasing the effectiveness of our heuristic. We computationally compare the proposed framework with a state-of-the-art MIP solver in terms of solution quality, scalability, reproducibility, and parallel efficiency. Results show the efficacy of our approach in finding high quality solutions quickly both as a standalone primal heuristic and when used in conjunction with an exact algorithm.

Alternating criteria search: a parallel large neighborhood search algorithm for mixed integer programs

Clustering coefficients is a building block in network sciences that offers insights on how tightly bound vertices are in a network. Effective and scalable parallelization of clustering coefficients requires load balancing amongst the cores. This property is not easy to achieve since many real world networks are scale free, which leads to some vertices requiring more attention than others. In this work we show two scalable approaches that load balance clustering coefficients. The first method achieves optimal load balancing with an Ο(|E|) storage requirement. The second method has a lower storage requirement of Ο(|V|) at the cost of some imbalance. While both methods have a similar time complexity, they represent a tradeoff between maintaining a balanced workload and memory complexity. Using a 40-core system we show that our load balancing techniques outperform the widely used and simple parallel approach by a factor of 3X-7.5X for real graphs and 1.5X-4X for random graphs. Further, we achieve 25X-35X speedup over the sequential algorithm for most of the graphs.

Load balanced clustering coefficients

Breadth-first search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a non-trivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a fine-grained task-based parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multi-core processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence.

Lluís-Miquel Munguía

Papers

Fast triangle counting on the GPU

Carbon Emissions and Large Neural Network Training.

Alternating criteria search: a parallel large neighborhood search algorithm for mixed integer programs

Load balanced clustering coefficients

Task-based parallel breadth-first search in heterogeneous environments