Home
/
Authors
/
Lluís-Miquel Munguía

Author

Lluís-Miquel Munguía

Other affiliations: Polytechnic University of Catalonia

Bio: Lluís-Miquel Munguía is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Parallel algorithm & Solver. The author has an hindex of 7, co-authored 9 publications receiving 202 citations. Previous affiliations of Lluís-Miquel Munguía include Polytechnic University of Catalonia.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Fast triangle counting on the GPU

[...]

Oded Green, Pavan Yalamanchili, Lluís-Miquel Munguía¹•Institutions (1)

Georgia Institute of Technology¹

16 Nov 2014

TL;DR: This paper shows the first scalable GPU implementation for triangle counting using a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm), which has two levels of parallelism.

...read moreread less

Abstract: Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Path (named after the Merge Path algorithm). This algorithm has two levels of parallelism. The first level partitions the vertices to the streaming multiprocessors on the GPU. The second level is responsible for parallelizing the work across the GPU's streaming processors and utilizing different block sizes. For testing purposes, we used graphs taken from the DIMACS 10 Graph Challenge. Our experiments were conducted on NVIDIA's K40 GPU. Our GPU triangle counting implementation achieves speedups in the range of 9X -- 32X over a CPU sequential implementation.

...read moreread less

73 citations

Posted Content•

Carbon Emissions and Large Neural Network Training.

[...]

David A. Patterson, Joseph E. Gonzalez, Quoc V. Le, Chen Liang, Lluís-Miquel Munguía, Daniel Rothchild, David R. So, Maud Texier, Jeffrey Dean - Show less +5 more

21 Apr 2021-arXiv: Learning

TL;DR: In this article, the authors calculate the energy use and carbon footprint of several recent large models, including T5, Meena, GShard, Switch Transformer, and GPT-3, and refine earlier estimates for the neural architecture search that found evolved transformer.

...read moreread less

Abstract: The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.

...read moreread less

54 citations

Journal Article•DOI•

Alternating criteria search: a parallel large neighborhood search algorithm for mixed integer programs

[...]

Lluís-Miquel Munguía¹, Shabbir Ahmed¹, David A. Bader¹, George L. Nemhauser¹, Yufen Shao² - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, ExxonMobil²

01 Jan 2018-Computational Optimization and Applications

TL;DR: A parallel large neighborhood search framework for finding high quality primal solutions for general mixed-integer programs (MIPs) with the dual objective of reducing infeasibility and optimizing with respect to the original objective.

...read moreread less

Abstract: We present a parallel large neighborhood search framework for finding high quality primal solutions for general mixed-integer programs (MIPs). The approach simultaneously solves a large number of sub-MIPs with the dual objective of reducing infeasibility and optimizing with respect to the original objective. Both goals are achieved by solving restricted versions of two auxiliary MIPs, where subsets of the variables are fixed. In contrast to prior approaches, ours does not require a feasible starting solution. We leverage parallelism to perform multiple searches simultaneously, with the objective of increasing the effectiveness of our heuristic. We computationally compare the proposed framework with a state-of-the-art MIP solver in terms of solution quality, scalability, reproducibility, and parallel efficiency. Results show the efficacy of our approach in finding high quality solutions quickly both as a standalone primal heuristic and when used in conjunction with an exact algorithm.

...read moreread less

23 citations

Proceedings Article•DOI•

Load balanced clustering coefficients

[...]

Oded Green¹, Lluís-Miquel Munguía¹, David A. Bader¹•Institutions (1)

Georgia Institute of Technology¹

16 Feb 2014

TL;DR: This work shows two scalable approaches that load balance clustering coefficients and achieves optimal load balancing with an Ο(|E|) storage requirement and a lower storage requirement at the cost of some imbalance.

...read moreread less

Abstract: Clustering coefficients is a building block in network sciences that offers insights on how tightly bound vertices are in a network. Effective and scalable parallelization of clustering coefficients requires load balancing amongst the cores. This property is not easy to achieve since many real world networks are scale free, which leads to some vertices requiring more attention than others. In this work we show two scalable approaches that load balance clustering coefficients. The first method achieves optimal load balancing with an Ο(|E|) storage requirement. The second method has a lower storage requirement of Ο(|V|) at the cost of some imbalance. While both methods have a similar time complexity, they represent a tradeoff between maintaining a balanced workload and memory complexity. Using a 40-core system we show that our load balancing techniques outperform the widely used and simple parallel approach by a factor of 3X-7.5X for real graphs and 1.5X-4X for random graphs. Further, we achieve 25X-35X speedup over the sequential algorithm for most of the graphs.

...read moreread less

22 citations

Proceedings Article•DOI•

Task-based parallel breadth-first search in heterogeneous environments

[...]

Lluís-Miquel Munguía¹, David A. Bader², Eduard Ayguadé³•Institutions (3)

Polytechnic University of Catalonia¹, Georgia Institute of Technology², Barcelona Supercomputing Center³

01 Dec 2012

TL;DR: This study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence, and uses a fine-grained task-based parallelization scheme and the OmpSs programming model to achieve that goal.

...read moreread less

Abstract: Breadth-first search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a non-trivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a fine-grained task-based parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multi-core processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence.

...read moreread less

21 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Ligra: a lightweight graph processing framework for shared memory

[...]

Julian Shun¹, Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

23 Feb 2013

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

Abstract: There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts.In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

816 citations

Using Mpi Portable Parallel Programming With The Message Passing Interface

[...]

Christina Freytag

01 Jan 2016

TL;DR: Thank you very much for downloading using mpi portable parallel programming with the message passing interface for reading a good book with a cup of coffee in the afternoon, instead they are facing with some malicious bugs inside their laptop.

...read moreread less

Abstract: Thank you very much for downloading using mpi portable parallel programming with the message passing interface. As you may know, people have search hundreds times for their chosen novels like this using mpi portable parallel programming with the message passing interface, but end up in harmful downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they are facing with some malicious bugs inside their laptop.

...read moreread less

593 citations

Journal Article•DOI•

A Survey of CPU-GPU Heterogeneous Computing Techniques

[...]

Sparsh Mittal¹, Jeffrey S. Vetter¹•Institutions (1)

Oak Ridge National Laboratory¹

21 Jul 2015-ACM Computing Surveys

TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.

...read moreread less

Abstract: As both CPUs and GPUs become employed in a wide range of applications, it has been acknowledged that both of these Processing Units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated a significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this article, we survey Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler, and application levels. Further, we review both discrete and fused CPU-GPU systems and discuss benchmark suites designed for evaluating Heterogeneous Computing Systems (HCSs). We believe that this article will provide insights into the workings and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

...read moreread less

414 citations

The SCIP Optimization Suite 7.0

[...]

Gerald Gamrath¹, Daniel Anderson², Ksenia Bestuzheva¹, Wei-Kun Chen, Leon Eifler¹, Maxime Gasse³, Patrick Gemander⁴, Ambros M. Gleixner¹, Leona Gottwald¹, Katrin Halbig⁴, Gregor Hendel¹, Christopher Hojny, Thorsten Koch¹, Pierre Le Bodic⁵, Stephen J. Maher⁶, Frederic Matter⁷, Matthias Miltenberger¹, Erik Muhmer⁸, Benjamin Müller¹, Marc E. Pfetsch⁷, Franziska Schlösser¹, Felipe Serrano¹, Yuji Shinano¹, Christine Maher Fouad Tawfik¹, Stefan Vigerske, Fabian Wegscheider¹, Dieter Weninger⁴, Jakob Witzig¹ - Show less +24 more•Institutions (8)

Zuse Institute Berlin¹, Carnegie Mellon University², École Polytechnique de Montréal³, University of Erlangen-Nuremberg⁴, Monash University⁵, University of Exeter⁶, Technische Universität Darmstadt⁷, RWTH Aachen University⁸

30 Mar 2020

TL;DR: New features and enhanced algorithms made available in version 5.0 of the SCIP Optimization Suite, in particular for the LP solver SoPlex, the Steiner tree solver SCIP-Jack, the MISDP solverSCIP-SDP, and the parallelization framework UG are described.

...read moreread less

Abstract: The SCIP Optimization Suite provides a collection of software packages for mathematical optimization centered around the constraint integer programming frame- work SCIP. This paper discusses enhancements and extensions contained in version 7.0 of the SCIP Optimization Suite. The new version features the parallel presolving library PaPILO as a new addition to the suite. PaPILO 1.0 simplifies mixed-integer linear op- timization problems and can be used stand-alone or integrated into SCIP via a presolver plugin. SCIP 7.0 provides additional support for decomposition algorithms. Besides im- provements in the Benders’ decomposition solver of SCIP, user-defined decomposition structures can be read, which are used by the automated Benders’ decomposition solver and two primal heuristics. Additionally, SCIP 7.0 comes with a tree size estimation that is used to predict the completion of the overall solving process and potentially trigger restarts. Moreover, substantial performance improvements of the MIP core were achieved by new developments in presolving, primal heuristics, branching rules, conflict analysis, and symmetry handling. Last, not least, the report presents updates to other components and extensions of the SCIP Optimization Suite, in particular, the LP solver SoPlex and the mixed-integer semidefinite programming solver SCIP-SDP.

...read moreread less

287 citations

Proceedings Article•DOI•

Multicore triangle computations without tuning

[...]

Julian Shun¹, Kanat Tangwongsan²•Institutions (2)

Carnegie Mellon University¹, Mahidol University International College²

13 Apr 2015

TL;DR: This paper describes the design and implementation of simple and fast multicore parallel algorithms for exact, as well as approximate, triangle counting and other triangle computations that scale to billions of nodes and edges, and is much faster than existing parallel approximate triangle counting implementations.

...read moreread less

Abstract: Triangle counting and enumeration has emerged as a basic tool in large-scale network analysis, fueling the development of algorithms that scale to massive graphs. Most of the existing algorithms, however, are designed for the distributed-memory setting or the external-memory setting, and cannot take full advantage of a multicore machine, whose capacity has grown to accommodate even the largest of real-world graphs.

...read moreread less

143 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Collapse