Showing papers by "Jesús Labarta published in 2016"

PDF

Open Access

Proceedings Article•DOI•

The mont-blanc prototype: an alternative approach for HPC systems

[...]

Nikola Rajovic¹, Alejandro Rico, Filippo Mantovani², Daniel Ruiz¹, Josep Oriol Vilarrubi², Constantino Gómez¹, Luna Backes², Diego Nieto², Harald Servat², Xavier Martorell¹, Jesús Labarta¹, Eduard Ayguadé¹, Chris Adeniyi-Jones, Said Derradji³, Herve Gloaguen³, Piero Lanucara, Nico Sanna, Jean-François Méhaut⁴, Kevin Pouget⁴, Brice Videau⁴, Eric Boyer, Momme Allalen, Axel Auweter, David Brayford, Daniele Tafani, Volker Weinberg, Dirk Brömmel⁵, Rene Halver⁵, Jan H. Meinke⁵, Ramon Beivide⁶, Mariano Benito⁶, Enrique Vallejo⁶, Mateo Valero¹, Alex Ramirez⁷ - Show less +30 more•Institutions (7)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center², Atos³, University of Grenoble⁴, Forschungszentrum Jülich⁵, University of Cantabria⁶, Nvidia⁷

13 Nov 2016

TL;DR: The Mont-Blanc project as mentioned in this paper presents the first HPC system built with commodity SoCs, memories, and network interface cards from the embedded and mobile domain, and off-the-shelf HPC networking, storage, cooling and integration solutions.

...read moreread less

Abstract: High-performance computing (HPC) is recognized as one of the pillars for further progress in science, industry, medicine, and education. Current HPC systems are being developed to overcome emerging architectural challenges in order to reach Exascale level of performance, projected for the year 2020. The much larger embedded and mobile market allows for rapid development of intellectual property (IP) blocks and provides more flexibility in designing an application-specific system-on-chip (SoC), in turn providing the possibility in balancing performance, energy-efficiency, and cost. In the Mont-Blanc project, we advocate for HPC systems being built from such commodity IP blocks, currently used in embedded and mobile SoCs. As a first demonstrator of such an approach, we present the Mont-Blanc prototype; the first HPC system built with commodity SoCs, memories, and network interface cards (NICs) from the embedded and mobile domain, and off-the-shelf HPC networking, storage, cooling, and integration solutions. We present the system's architecture and evaluate both performance and energy efficiency. Further, we compare the system's abilities against a production level supercomputer. At the end, we discuss parallel scalability and estimate the maximum scalability point of this approach across a set of applications.

...read moreread less

57 citations

Journal Article•DOI•

Dynamic load balance applied to particle transport in fluids

[...]

Guillaume Houzeaux¹, Marta Garcia¹, J.C. Cajas¹, Antoni Artigues¹, Edgar Olivares¹, Jesús Labarta¹, Mariano Vázquez¹ - Show less +3 more•Institutions (1)

Barcelona Supercomputing Center¹

01 Jul 2016-International Journal of Computational Fluid Dynamics

TL;DR: A dynamic load balancing library is used on the top of OpenMP pragmas in order to continuously exploit all the resources available at the node level, thus increasing the load balance and the efficiency of the parallelisation and uses the MPI.

...read moreread less

Abstract: This work presents a parallel numerical strategy to transport Lagrangian particles in a fluid using a dynamic load balance strategy. Both fluid and particle solvers are parallel, with two levels of parallelism. The first level is based on a substructuring technique and uses message passing interface MPI as the communication library; the second level consists of OpenMP pragmas for loop parallelisation at the node level. When dealing with transient flows, there exist two main alternatives to address the coupling of these solvers. On the one hand, a single-code approach consists in solving the particle equations once the fluid solution has been obtained at the end of a time step, using the same instance of the same code. On the other hand, a multi-code approach enables one to overlap the transport of the particles with the next time-step solution of the fluid equations, and thus obtain asynchronism. In this case, different codes or two instances of the same code can be used. Both approaches will be presented. In addition, a dynamic load balancing library is used on the top of OpenMP pragmas in order to continuously exploit all the resources available at the node level, thus increasing the load balance and the efficiency of the parallelisation and uses the MPI.

...read moreread less

25 citations

Proceedings Article•DOI•

MUSA: a multi-level simulation approach for next-generation HPC machines

[...]

Thomas Grass¹, Cesar Allande², Adria Armejach², Alejandro Rico, Eduard Ayguadé¹, Jesús Labarta¹, Mateo Valero¹, Marc Casas², Miquel Moreto¹ - Show less +5 more•Institutions (2)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center²

13 Nov 2016

TL;DR: This paper uses MUSA to simulate up to 16,384 cores and successfully identify scalability bottlenecks due to different factors, e.g. memory contention or load imbalance, showing how MUSA can help system designers to assess the usefulness of future technologies in next-generation HPC machines.

...read moreread less

Abstract: The complexity of High Performance Computing (HPC) systems is increasing in the number of components and their heterogeneity. Interactions between software and hardware involve many different aspects which are typically not transparent to scientific programmers and system architects. Therefore, predicting the behavior of current scientific applications on future HPC infrastructures is a challenging task. In this paper we present MUSA, an end-to-end methodology that employs a multi-level simulation infrastructure. By combining different levels of abstraction, MUSA is able to model the communication network, microarchitectural details and system software interactions, providing different trade-offs in terms of simulation cost and accuracy. We compare detailed MUSA simulations with native executions of up to 2,048 cores and find relative errors that are within 10% in the common case. In addition, we use MUSA to simulate up to 16,384 cores and successfully identify scalability bottlenecks due to different factors, e.g. memory contention or load imbalance. We also compare different system configurations, showing how MUSA can help system designers to assess the usefulness of future technologies in next-generation HPC machines.

...read moreread less

24 citations

Proceedings Article•

Heterogeneous Streaming

[...]

Chris J. Newburn¹, Gaurav Bansal¹, Michael Wood, Luis Crivelli, Judit Planas², Alejandro Duran¹, Paulo Vitor de Campos Souza³, Leonardo Luiz Borges⁴, Piotr Luszczek⁵, Stanimire Tomov⁵, Jack Dongarra⁵, Hartwig Anzt⁵, Mark Gates⁵, Azzam Haidar⁵, Yulu Jia⁵, Khairul Kabir⁵, Ichitaro Yamazaki⁵, Jesús Labarta⁶ - Show less +14 more•Institutions (6)

Intel¹, École Normale Supérieure², Universidade Federal de Minas Gerais³, Pontifícia Universidade Católica de Goiás⁴, University of Tennessee⁵, Polytechnic University of Catalonia⁶

01 May 2016

TL;DR: It is shown how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets.

...read moreread less

Abstract: This paper introduces a new heterogeneous streaminglibrary called hetero Streams (hStreams). We show how asimple FIFO streaming model can be applied to heterogeneoussystems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among taskswithin a node, and between data transfers and computation. Wegive examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for differentheterogeneous targets. Practical application examples are takenfrom the field of numerical linear algebra, commercial structuralsimulation software, and a seismic processing application.

...read moreread less

20 citations

Proceedings Article•DOI•

Spatial support vector regression to detect silent errors in the exascale era

[...]

Omer Subasi¹, Sheng Di², Leonardo Bautista-Gomez², Prasanna Balaprakash², Osman Unsal³, Jesús Labarta¹, Adrian Cristal¹, Franck Cappello² - Show less +4 more•Institutions (3)

Polytechnic University of Catalonia¹, Argonne National Laboratory², Barcelona Supercomputing Center³

16 May 2016

TL;DR: A low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound, exhibits the best tradeoff considering the detection ability and overheads.

...read moreread less

Abstract: As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the execution results of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatial features (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that our detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.

...read moreread less

18 citations

Proceedings Article•DOI•

CATA: Criticality Aware Task Acceleration for Multicore Processors

[...]

Emilio Castillo¹, Miquel Moreto¹, Marc Casas¹, Lluc Alvarez¹, Enrique Vallejo², Kallia Chronaki¹, Rosa M. Badia³, José Luis Bosque², Ramon Beivide², Eduard Ayguadé¹, Jesús Labarta¹, Mateo Valero¹ - Show less +8 more•Institutions (3)

Barcelona Supercomputing Center¹, University of Cantabria², Spanish National Research Council³

23 May 2016

TL;DR: A Criticality Aware Task Acceleration (CATA) mechanism is proposed that dynamically adapts the computational power of a task depending on its criticality, outperforming state-of-the-art acceleration proposals not aware of task criticality.

...read moreread less

Abstract: Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.

...read moreread less

16 citations

Proceedings Article•DOI•

Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes

[...]

Dimitrios Chasapis¹, Marc Casas¹, Miquel Moreto¹, Martin Schulz², Eduard Ayguadé¹, Jesús Labarta¹, Mateo Valero¹ - Show less +3 more•Institutions (2)

Polytechnic University of Catalonia¹, Lawrence Livermore National Laboratory²

01 Jun 2016

TL;DR: This work shows how a parallel runtime system can be used to effectively deal with a new kind of performance heterogeneity by compensating the uneven effects of power capping, and compares its novel runtime analysis with an offline approach and demonstrates that it can achieve equal performance at a fraction of the cost.

...read moreread less

Abstract: Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. Researchers in academia, labs and industry are focusing on dealing with this "power wall", striving to find a balance between performance and power consumption. Some commodity processors enable power capping, which opens up new opportunities for applications to directly manage their power behavior at user level. However, while power capping ensures a system will never exceed a given power limit, it also leads to a new form of heterogeneity: natural manufacturing variability, which was previously hidden by varying power to achieve homogeneous performance, now results in heterogeneous performance caused by different CPU frequencies, potentially for each core, to enforce the power limit.In this work we show how a parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi-core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost.

...read moreread less

15 citations

Posted Content•

Limitations and Alternatives for the Evaluation of Large-scale Link Prediction.

[...]

Dario Garcia-Gasulla, Eduard Ayguadé, Jesús Labarta, Ulises Cortés

02 Nov 2016-arXiv: Social and Information Networks

TL;DR: An innovative modification to a traditional evaluation methodology is introduced with the goal of adapting it to the problem of evaluating link prediction algorithms when applied to large graphs, by tackling the issue of class imbalance.

...read moreread less

Abstract: Link prediction, the problem of identifying missing links among a set of inter-related data entities, is a popular field of research due to its application to graph-like domains. Producing consistent evaluations of the performance of the many link prediction algorithms being proposed can be challenging due to variable graph properties, such as size and density. In this paper we first discuss traditional data mining solutions which are applicable to link prediction evaluation, arguing about their capacity for producing faithful and useful evaluations. We also introduce an innovative modification to a traditional evaluation methodology with the goal of adapting it to the problem of evaluating link prediction algorithms when applied to large graphs, by tackling the problem of class imbalance. We empirically evaluate the proposed methodology and, building on these findings, make a case for its importance on the evaluation of large-scale graph processing.

...read moreread less

12 citations

Proceedings Article•DOI•

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

[...]

Omer Subasi¹, Gulay Yalcin², Ferad Zyulkyarov³, Osman Unsal³, Jesús Labarta¹ - Show less +1 more•Institutions (3)

Polytechnic University of Catalonia¹, Abdullah Gül University², Barcelona Supercomputing Center³

01 Sep 2016

TL;DR: This paper proposes a runtime-based selective task replication technique that is automatic and does not require modification/recompilation of OS, compiler or application code, and shows that with App FIT, it can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.

...read moreread less

Abstract: In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.

...read moreread less

11 citations

Proceedings Article•DOI•

Reducing Cache Coherence Traffic with Hierarchical Directory Cache and NUMA-Aware Runtime Scheduling

[...]

Paul Caheny¹, Marc Casas¹, Miquel Moreto¹, Herve Gloaguen², Maxime Saintes², Eduard Ayguadé¹, Jesús Labarta¹, Mateo Valero¹ - Show less +4 more•Institutions (2)

Barcelona Supercomputing Center¹, Atos²

11 Sep 2016

TL;DR: This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach and shows that the NUMA-aware techniques employed at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence Traffic to the system.

...read moreread less

Abstract: Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 1.23× to 2.54× and coherence traffic reductions between 44% and 77% in comparison to NUMA-oblivious scheduling and data allocation. Furthermore, we show that the NUMA-aware techniques we employ at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence traffic to the system.

...read moreread less

11 citations

Proceedings Article•DOI•

CRC-Based Memory Reliability for Task-Parallel HPC Applications

[...]

Omer Subasi, Osman Unsal, Jesús Labarta, Gulay Yalcin¹, Adrian Cristal - Show less +1 more•Institutions (1)

Abdullah Gül University¹

23 May 2016

TL;DR: This work proposes a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications that reduces the memory vulnerability by 87% on average with up to 32-bit burst and 5-bit arbitrary error correction capability.

...read moreread less

Abstract: Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the mostcommonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overheadwith hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRC-based mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability.

...read moreread less

Journal Article•DOI•

Detailed and simultaneous power and performance analysis

[...]

Harald Servat¹, Harald Servat², Germán Llort², Germán Llort¹, Judit Gimenez¹, Judit Gimenez², Jesús Labarta¹, Jesús Labarta² - Show less +4 more•Institutions (2)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center²

01 Feb 2016-Concurrency and Computation: Practice and Experience

TL;DR: On the road to Exascale computing, both performance and power areas are meant to be tackled at different levels, from system to processor level, so it is important to have tools to simultaneously analyze both performanceand energy efficiency at processor level.

...read moreread less

Abstract: On the road to Exascale computing, both performance and power areas are meant to be tackled at different levels, from system to processor level. The processor itself is the main responsible for the serial node performance and also for the most of the energy consumed by the system. Thus, it is important to have tools to simultaneously analyze both performance and energy efficiency at processor level.

...read moreread less

Proceedings Article•DOI•

Bio-Inspired Call-Stack Reconstruction for Performance Analysis

[...]

Harald Servat, Germán Llort, Juan Gonzalez¹, Judit Gimssnez, Jesús Labarta - Show less +1 more•Institutions (1)

IBM¹

04 Apr 2016

TL;DR: This paper presents a novel portable approach to associate performance issues with their source code counterpart using an algorithm inspired by multi-sequence alignment techniques that is easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code.

...read moreread less

Abstract: The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses.

...read moreread less

Book Chapter•DOI•

The Secrets of the Accelerators Unveiled: Tracing Heterogeneous Executions Through OMPT

[...]

Germán Llort¹, Germán Llort², Antonio Filgueras², Antonio Filgueras¹, Daniel Jiménez-González², Daniel Jiménez-González¹, Harald Servat³, Xavier Teruel¹, Xavier Teruel², Estanislao Mercadal², Estanislao Mercadal¹, Carlos Alvarez², Carlos Alvarez¹, Judit Gimenez¹, Judit Gimenez², Xavier Martorell², Xavier Martorell¹, Eduard Ayguadé², Eduard Ayguadé¹, Jesús Labarta¹, Jesús Labarta² - Show less +17 more•Institutions (3)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center², Intel³

05 Oct 2016

TL;DR: Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them.

...read moreread less

Abstract: Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them.

...read moreread less

Book Chapter•DOI•

Multiple Target Task Sharing Support for the OpenMP Accelerator Model

[...]

Guray Ozen¹, Guray Ozen², Sergi Mateo², Sergi Mateo¹, Eduard Ayguadé², Eduard Ayguadé¹, Jesús Labarta¹, Jesús Labarta², James Beyer³ - Show less +5 more•Institutions (3)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center², Nvidia³

05 Oct 2016

TL;DR: This paper proposes an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices, and proposes a new proposed clause that conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface.

...read moreread less

Abstract: The use of GPU accelerators is becoming common in HPC platforms due to the their effective performance and energy efficiency. In addition, new generations of multicore processors are being designed with wider vector units and/or larger hardware thread counts, also contributing to the peak performance of the whole system. Although current directive–based paradigms, such as OpenMP or OpenACC, support both accelerators and multicore-based hosts, they do not provide an effective and efficient way to concurrently use them, usually resulting in accelerated programs in which the potential computational performance of the host is not exploited. In this paper we propose an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices (i.e. accelerators in conjunction with the vector and heavily multithreaded capabilities in multicore processors). The compiler is responsible for the generation of device-specific code for each device kind, delegating to the runtime system the dynamic schedule of the tasks to the available devices. The new proposed clause conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface. The potential of the proposal is analyzed in a prototype implementation in the OmpSs compiler and runtime infrastructure. Performance evaluation is done using three kernels (N-Body, tiled matrix multiply and Stream) on different GPU-capable systems based on ARM, Intel x86 and IBM Power8. From the evaluation we observe speed–ups in the 8–20% range compared to versions in which only the GPU is used, reaching 96 % of the additional peak performance thanks to the reduction of data transfers and the benefits introduced by the OmpSs NUMA-aware scheduler.

...read moreread less

Posted Content•

Hierarchical Hyperlink Prediction for the WWW

[...]

Dario Garcia-Gasulla, Eduard Ayguadé, Jesús Labarta, Ulises Cortés, Toyotaro Suzumura - Show less +1 more

28 Nov 2016-arXiv: Data Structures and Algorithms

TL;DR: This work explores the performance of similarity-based algorithms for the particular problem of hyperlink prediction on large webgraphs, and proposes a novel method which assumes the existence of hierarchical properties, and evaluates this new approach on several web graphs and compares its performance with that of the current best similarity- based algorithms.

...read moreread less

Abstract: The hyperlink prediction task, that of proposing new links between webpages, can be used to improve search engines, expand the visibility of web pages, and increase the connectivity and navigability of the web. Hyperlink prediction is typically performed on webgraphs composed by thousands or millions of vertices, where on average each webpage contains less than fifty links. Algorithms processing graphs so large and sparse require to be both scalable and precise, a challenging combination. Similarity-based algorithms are among the most scalable solutions within the link prediction field, due to their parallel nature and computational simplicity. These algorithms independently explore the nearby topological features of every missing link from the graph in order to determine its likelihood. Unfortunately, the precision of similarity-based algorithms is limited, which has prevented their broad application so far. In this work we explore the performance of similarity-based algorithms for the particular problem of hyperlink prediction on large webgraphs, and propose a novel method which assumes the existence of hierarchical properties. We evaluate this new approach on several webgraphs and compare its performance with that of the current best similarity-based algorithms. Its remarkable performance leads us to argue on the applicability of the proposal, identifying several use cases of hyperlink prediction. We also describes the approach we took for the computation of large-scale graphs from the perspective of high-performance computing, providing details on the implementation and parallelization of code.

...read moreread less

Book Chapter•DOI•

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

[...]

Jan Ciesko¹, Sergi Mateo¹, Sergi Mateo², Xavier Teruel¹, Xavier Martorell², Xavier Martorell¹, Eduard Ayguadé², Eduard Ayguadé¹, Jesús Labarta², Jesús Labarta¹ - Show less +6 more•Institutions (2)

Barcelona Supercomputing Center¹, Polytechnic University of Catalonia²

05 Oct 2016

TL;DR: This work proposes a solution framework that generalizes the concept of privatization to support a variety of techniques, implements an inspector-executor to provide memory access analytics to the runtime for automatic tuning and shows what language extensions are needed.

...read moreread less

Abstract: Irregular array-type reductions represent a reoccurring algorithmic pattern in many scientific applications. Their scalable execution on modern systems is not trivial as their irregular memory access pattern prohibits an efficient use of the memory subsystem and costly techniques are needed to eliminate data races. Taking a closer look at algorithms, memory access patterns and support techniques reveals that a one-size-fits-all solution does not exist and approaches are needed that can adapt to individual properties while maintaining programming transparency. In this work we propose a solution framework that generalizes the concept of privatization to support a variety of techniques, implements an inspector-executor to provide memory access analytics to the runtime for automatic tuning and shows what language extensions are needed. A reference implementation in OmpSs, a task-parallel programming model, shows programmability and scalability of this solution.

...read moreread less

Proceedings Article•DOI•

POSTER: Exploiting asymmetric multi-core processors with flexible system software

[...]

Kallia Chronaki¹, Miquel Moreto¹, Marc Casas¹, Alejandro Rico, Rosa M. Badia¹, Eduard Ayguadé¹, Jesús Labarta¹, Mateo Valero¹ - Show less +4 more•Institutions (1)

Barcelona Supercomputing Center¹

11 Sep 2016

TL;DR: Evaluating emerging parallel applications on an asymmetric multi-core architecture using the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture concludes that these applications are not mature enough to run on such systems, as they suffer from load imbalance.

...read moreread less

Abstract: Energy efficiency has become the main challenge for high performance computing (HPC). The use of mobile asymmetric multi-core architectures to build future multi-core systems is an approach towards energy savings while keeping high performance. However, it is not known yet whether such systems are ready to handle parallel applications.This paper fills this gap by evaluating emerging parallel applications on an asymmetric multi-core. We make use of the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture. We conclude that these applications are not mature enough to run on such systems, as they suffer from load imbalance.Furthermore, we explore the behaviour of dynamic scheduling solutions on either the Operating System (OS) or the runtime level. Comparing these approaches shows us that the most efficient scheduling takes place in the runtime level, influencing the future research towards such solutions.

...read moreread less

Proceedings Article•DOI•

POSTER: Collective Dynamic Parallelism for Directive Based GPU Programming Languages and Compilers

[...]

Guray Ozen¹, Eduard Ayguadé¹, Jesús Labarta¹•Institutions (1)

Polytechnic University of Catalonia¹

11 Sep 2016

TL;DR: This paper proposes and evaluates the basic idea behind a user-directed code transformation technique, named collective dynamic parallelism, that targets the effective exploitation of nested parallelism in modern GPUs, and shows that for sparse matrix vector multiplication, CollectiveDP outperforms well optimized libraries, making GPU useful when matrices are highly irregular.

...read moreread less

Abstract: Early programs for GPU (Graphics Processing Units) acceleration were based on a flat, bulk parallel programming model, in which programs had to perform a sequence of kernel launches from the host CPU. In the latest releases of these devices, dynamic (or nested) parallelism is supported, making possible to launch kernels from threads running on the device, without host intervention. Unfortunately, the overhead of launching kernels from the device is higher compared to launching from the host CPU, making the exploitation of dynamic parallelism unprofitable. This paper proposes and evaluates the basic idea behind a user-directed code transformation technique, named collective dynamic parallelism, that targets the effective exploitation of nested parallelism in modern GPUs. The technique dynamically packs dynamic parallelism kernel invocations and postpones their execution until a bunch of them are available. We show that for sparse matrix vector multiplication, CollectiveDP outperforms well optimized libraries, making GPU useful when matrices are highly irregular.

...read moreread less

Book Chapter•DOI•

On the Representativeness of Convolutional Neural Networks Layers

[...]

Dario Garcia-Gasulla, Jonathan Moreno, Raúl Ramos-Pollán, Romel Casadiegos Barrios, Javier Béjar, Ulises Cortés, Eduard Ayguadé, Jesús Labarta, Toyotaro Suzumura - Show less +5 more

01 Jan 2016

TL;DR: This work extracts features from CNN layers, building vector representations from CNN activations, and defines a taxonomy of knowledge as perceived by the CNN, indicating that, while top layers provide the most representative space, low layers also define descriptive dimensions.

...read moreread less

Abstract: Convolutional Neural Networks (CNN) are the most popular of deep network models due to their applicability and success in image processing. Although plenty of effort has been made in designing and training better discriminative CNNs, little is yet known about the internal features these models learn. Questions like, what specific knowledge is coded within CNN layers, and how can it be used for other purposes besides discrimination, remain to be answered. To advance in the resolution of these questions, in this work we extract features from CNN layers, building vector representations from CNN activations. The resultant vector embedding is used to represent first images and then known image classes. On those representations we perform an unsupervised clustering process, with the goal of studying the hidden semantics captured in the embedding space. Several abstract entities untaught to the network emerge in this process, effectively defining a taxonomy of knowledge as perceived by the CNN. We evaluate and interpret these sets using WordNet, while studying the different behaviours exhibited by the layers of a CNN model according to their depth. Our results indicate that, while top (i.e., deeper) layers provide the most representative space, low layers also define descriptive dimensions.

...read moreread less