scispace - formally typeset
Search or ask a question

Showing papers by "Jesús Labarta published in 2016"


Proceedings ArticleDOI
13 Nov 2016
TL;DR: The Mont-Blanc project as mentioned in this paper presents the first HPC system built with commodity SoCs, memories, and network interface cards from the embedded and mobile domain, and off-the-shelf HPC networking, storage, cooling and integration solutions.
Abstract: High-performance computing (HPC) is recognized as one of the pillars for further progress in science, industry, medicine, and education. Current HPC systems are being developed to overcome emerging architectural challenges in order to reach Exascale level of performance, projected for the year 2020. The much larger embedded and mobile market allows for rapid development of intellectual property (IP) blocks and provides more flexibility in designing an application-specific system-on-chip (SoC), in turn providing the possibility in balancing performance, energy-efficiency, and cost. In the Mont-Blanc project, we advocate for HPC systems being built from such commodity IP blocks, currently used in embedded and mobile SoCs. As a first demonstrator of such an approach, we present the Mont-Blanc prototype; the first HPC system built with commodity SoCs, memories, and network interface cards (NICs) from the embedded and mobile domain, and off-the-shelf HPC networking, storage, cooling, and integration solutions. We present the system's architecture and evaluate both performance and energy efficiency. Further, we compare the system's abilities against a production level supercomputer. At the end, we discuss parallel scalability and estimate the maximum scalability point of this approach across a set of applications.

57 citations


Journal ArticleDOI
TL;DR: A dynamic load balancing library is used on the top of OpenMP pragmas in order to continuously exploit all the resources available at the node level, thus increasing the load balance and the efficiency of the parallelisation and uses the MPI.
Abstract: This work presents a parallel numerical strategy to transport Lagrangian particles in a fluid using a dynamic load balance strategy. Both fluid and particle solvers are parallel, with two levels of parallelism. The first level is based on a substructuring technique and uses message passing interface MPI as the communication library; the second level consists of OpenMP pragmas for loop parallelisation at the node level. When dealing with transient flows, there exist two main alternatives to address the coupling of these solvers. On the one hand, a single-code approach consists in solving the particle equations once the fluid solution has been obtained at the end of a time step, using the same instance of the same code. On the other hand, a multi-code approach enables one to overlap the transport of the particles with the next time-step solution of the fluid equations, and thus obtain asynchronism. In this case, different codes or two instances of the same code can be used. Both approaches will be presented. In addition, a dynamic load balancing library is used on the top of OpenMP pragmas in order to continuously exploit all the resources available at the node level, thus increasing the load balance and the efficiency of the parallelisation and uses the MPI.

25 citations


Proceedings ArticleDOI
13 Nov 2016
TL;DR: This paper uses MUSA to simulate up to 16,384 cores and successfully identify scalability bottlenecks due to different factors, e.g. memory contention or load imbalance, showing how MUSA can help system designers to assess the usefulness of future technologies in next-generation HPC machines.
Abstract: The complexity of High Performance Computing (HPC) systems is increasing in the number of components and their heterogeneity. Interactions between software and hardware involve many different aspects which are typically not transparent to scientific programmers and system architects. Therefore, predicting the behavior of current scientific applications on future HPC infrastructures is a challenging task. In this paper we present MUSA, an end-to-end methodology that employs a multi-level simulation infrastructure. By combining different levels of abstraction, MUSA is able to model the communication network, microarchitectural details and system software interactions, providing different trade-offs in terms of simulation cost and accuracy. We compare detailed MUSA simulations with native executions of up to 2,048 cores and find relative errors that are within 10% in the common case. In addition, we use MUSA to simulate up to 16,384 cores and successfully identify scalability bottlenecks due to different factors, e.g. memory contention or load imbalance. We also compare different system configurations, showing how MUSA can help system designers to assess the usefulness of future technologies in next-generation HPC machines.

24 citations


Proceedings Article
01 May 2016
TL;DR: It is shown how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets.
Abstract: This paper introduces a new heterogeneous streaminglibrary called hetero Streams (hStreams). We show how asimple FIFO streaming model can be applied to heterogeneoussystems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among taskswithin a node, and between data transfers and computation. Wegive examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for differentheterogeneous targets. Practical application examples are takenfrom the field of numerical linear algebra, commercial structuralsimulation software, and a seismic processing application.

20 citations


Proceedings ArticleDOI
16 May 2016
TL;DR: A low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound, exhibits the best tradeoff considering the detection ability and overheads.
Abstract: As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the execution results of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatial features (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that our detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.

18 citations


Proceedings ArticleDOI
23 May 2016
TL;DR: A Criticality Aware Task Acceleration (CATA) mechanism is proposed that dynamically adapts the computational power of a task depending on its criticality, outperforming state-of-the-art acceleration proposals not aware of task criticality.
Abstract: Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.

16 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: This work shows how a parallel runtime system can be used to effectively deal with a new kind of performance heterogeneity by compensating the uneven effects of power capping, and compares its novel runtime analysis with an offline approach and demonstrates that it can achieve equal performance at a fraction of the cost.
Abstract: Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. Researchers in academia, labs and industry are focusing on dealing with this "power wall", striving to find a balance between performance and power consumption. Some commodity processors enable power capping, which opens up new opportunities for applications to directly manage their power behavior at user level. However, while power capping ensures a system will never exceed a given power limit, it also leads to a new form of heterogeneity: natural manufacturing variability, which was previously hidden by varying power to achieve homogeneous performance, now results in heterogeneous performance caused by different CPU frequencies, potentially for each core, to enforce the power limit.In this work we show how a parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi-core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost.

15 citations


Posted Content
TL;DR: An innovative modification to a traditional evaluation methodology is introduced with the goal of adapting it to the problem of evaluating link prediction algorithms when applied to large graphs, by tackling the issue of class imbalance.
Abstract: Link prediction, the problem of identifying missing links among a set of inter-related data entities, is a popular field of research due to its application to graph-like domains. Producing consistent evaluations of the performance of the many link prediction algorithms being proposed can be challenging due to variable graph properties, such as size and density. In this paper we first discuss traditional data mining solutions which are applicable to link prediction evaluation, arguing about their capacity for producing faithful and useful evaluations. We also introduce an innovative modification to a traditional evaluation methodology with the goal of adapting it to the problem of evaluating link prediction algorithms when applied to large graphs, by tackling the problem of class imbalance. We empirically evaluate the proposed methodology and, building on these findings, make a case for its importance on the evaluation of large-scale graph processing.

12 citations


Proceedings ArticleDOI
01 Sep 2016
TL;DR: This paper proposes a runtime-based selective task replication technique that is automatic and does not require modification/recompilation of OS, compiler or application code, and shows that with App FIT, it can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.
Abstract: In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.

11 citations


Proceedings ArticleDOI
11 Sep 2016
TL;DR: This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach and shows that the NUMA-aware techniques employed at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence Traffic to the system.
Abstract: Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 1.23× to 2.54× and coherence traffic reductions between 44% and 77% in comparison to NUMA-oblivious scheduling and data allocation. Furthermore, we show that the NUMA-aware techniques we employ at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence traffic to the system.

11 citations


Proceedings ArticleDOI
23 May 2016
TL;DR: This work proposes a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications that reduces the memory vulnerability by 87% on average with up to 32-bit burst and 5-bit arbitrary error correction capability.
Abstract: Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the mostcommonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overheadwith hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRC-based mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability.

Journal ArticleDOI
TL;DR: On the road to Exascale computing, both performance and power areas are meant to be tackled at different levels, from system to processor level, so it is important to have tools to simultaneously analyze both performanceand energy efficiency at processor level.
Abstract: On the road to Exascale computing, both performance and power areas are meant to be tackled at different levels, from system to processor level. The processor itself is the main responsible for the serial node performance and also for the most of the energy consumed by the system. Thus, it is important to have tools to simultaneously analyze both performance and energy efficiency at processor level.

Proceedings ArticleDOI
Harald Servat, Germán Llort, Juan Gonzalez1, Judit Gimssnez, Jesús Labarta 
04 Apr 2016
TL;DR: This paper presents a novel portable approach to associate performance issues with their source code counterpart using an algorithm inspired by multi-sequence alignment techniques that is easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code.
Abstract: The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses.

Book ChapterDOI
05 Oct 2016
TL;DR: Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them.
Abstract: Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them.

Book ChapterDOI
05 Oct 2016
TL;DR: This paper proposes an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices, and proposes a new proposed clause that conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface.
Abstract: The use of GPU accelerators is becoming common in HPC platforms due to the their effective performance and energy efficiency. In addition, new generations of multicore processors are being designed with wider vector units and/or larger hardware thread counts, also contributing to the peak performance of the whole system. Although current directive–based paradigms, such as OpenMP or OpenACC, support both accelerators and multicore-based hosts, they do not provide an effective and efficient way to concurrently use them, usually resulting in accelerated programs in which the potential computational performance of the host is not exploited. In this paper we propose an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices (i.e. accelerators in conjunction with the vector and heavily multithreaded capabilities in multicore processors). The compiler is responsible for the generation of device-specific code for each device kind, delegating to the runtime system the dynamic schedule of the tasks to the available devices. The new proposed clause conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface. The potential of the proposal is analyzed in a prototype implementation in the OmpSs compiler and runtime infrastructure. Performance evaluation is done using three kernels (N-Body, tiled matrix multiply and Stream) on different GPU-capable systems based on ARM, Intel x86 and IBM Power8. From the evaluation we observe speed–ups in the 8–20% range compared to versions in which only the GPU is used, reaching 96 % of the additional peak performance thanks to the reduction of data transfers and the benefits introduced by the OmpSs NUMA-aware scheduler.

Posted Content
TL;DR: This work explores the performance of similarity-based algorithms for the particular problem of hyperlink prediction on large webgraphs, and proposes a novel method which assumes the existence of hierarchical properties, and evaluates this new approach on several web graphs and compares its performance with that of the current best similarity- based algorithms.
Abstract: The hyperlink prediction task, that of proposing new links between webpages, can be used to improve search engines, expand the visibility of web pages, and increase the connectivity and navigability of the web. Hyperlink prediction is typically performed on webgraphs composed by thousands or millions of vertices, where on average each webpage contains less than fifty links. Algorithms processing graphs so large and sparse require to be both scalable and precise, a challenging combination. Similarity-based algorithms are among the most scalable solutions within the link prediction field, due to their parallel nature and computational simplicity. These algorithms independently explore the nearby topological features of every missing link from the graph in order to determine its likelihood. Unfortunately, the precision of similarity-based algorithms is limited, which has prevented their broad application so far. In this work we explore the performance of similarity-based algorithms for the particular problem of hyperlink prediction on large webgraphs, and propose a novel method which assumes the existence of hierarchical properties. We evaluate this new approach on several webgraphs and compare its performance with that of the current best similarity-based algorithms. Its remarkable performance leads us to argue on the applicability of the proposal, identifying several use cases of hyperlink prediction. We also describes the approach we took for the computation of large-scale graphs from the perspective of high-performance computing, providing details on the implementation and parallelization of code.

Book ChapterDOI
05 Oct 2016
TL;DR: This work proposes a solution framework that generalizes the concept of privatization to support a variety of techniques, implements an inspector-executor to provide memory access analytics to the runtime for automatic tuning and shows what language extensions are needed.
Abstract: Irregular array-type reductions represent a reoccurring algorithmic pattern in many scientific applications. Their scalable execution on modern systems is not trivial as their irregular memory access pattern prohibits an efficient use of the memory subsystem and costly techniques are needed to eliminate data races. Taking a closer look at algorithms, memory access patterns and support techniques reveals that a one-size-fits-all solution does not exist and approaches are needed that can adapt to individual properties while maintaining programming transparency. In this work we propose a solution framework that generalizes the concept of privatization to support a variety of techniques, implements an inspector-executor to provide memory access analytics to the runtime for automatic tuning and shows what language extensions are needed. A reference implementation in OmpSs, a task-parallel programming model, shows programmability and scalability of this solution.

Proceedings ArticleDOI
11 Sep 2016
TL;DR: Evaluating emerging parallel applications on an asymmetric multi-core architecture using the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture concludes that these applications are not mature enough to run on such systems, as they suffer from load imbalance.
Abstract: Energy efficiency has become the main challenge for high performance computing (HPC). The use of mobile asymmetric multi-core architectures to build future multi-core systems is an approach towards energy savings while keeping high performance. However, it is not known yet whether such systems are ready to handle parallel applications.This paper fills this gap by evaluating emerging parallel applications on an asymmetric multi-core. We make use of the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture. We conclude that these applications are not mature enough to run on such systems, as they suffer from load imbalance.Furthermore, we explore the behaviour of dynamic scheduling solutions on either the Operating System (OS) or the runtime level. Comparing these approaches shows us that the most efficient scheduling takes place in the runtime level, influencing the future research towards such solutions.

Proceedings ArticleDOI
11 Sep 2016
TL;DR: This paper proposes and evaluates the basic idea behind a user-directed code transformation technique, named collective dynamic parallelism, that targets the effective exploitation of nested parallelism in modern GPUs, and shows that for sparse matrix vector multiplication, CollectiveDP outperforms well optimized libraries, making GPU useful when matrices are highly irregular.
Abstract: Early programs for GPU (Graphics Processing Units) acceleration were based on a flat, bulk parallel programming model, in which programs had to perform a sequence of kernel launches from the host CPU. In the latest releases of these devices, dynamic (or nested) parallelism is supported, making possible to launch kernels from threads running on the device, without host intervention. Unfortunately, the overhead of launching kernels from the device is higher compared to launching from the host CPU, making the exploitation of dynamic parallelism unprofitable. This paper proposes and evaluates the basic idea behind a user-directed code transformation technique, named collective dynamic parallelism, that targets the effective exploitation of nested parallelism in modern GPUs. The technique dynamically packs dynamic parallelism kernel invocations and postpones their execution until a bunch of them are available. We show that for sparse matrix vector multiplication, CollectiveDP outperforms well optimized libraries, making GPU useful when matrices are highly irregular.

Book ChapterDOI
01 Jan 2016
TL;DR: This work extracts features from CNN layers, building vector representations from CNN activations, and defines a taxonomy of knowledge as perceived by the CNN, indicating that, while top layers provide the most representative space, low layers also define descriptive dimensions.
Abstract: Convolutional Neural Networks (CNN) are the most popular of deep network models due to their applicability and success in image processing. Although plenty of effort has been made in designing and training better discriminative CNNs, little is yet known about the internal features these models learn. Questions like, what specific knowledge is coded within CNN layers, and how can it be used for other purposes besides discrimination, remain to be answered. To advance in the resolution of these questions, in this work we extract features from CNN layers, building vector representations from CNN activations. The resultant vector embedding is used to represent first images and then known image classes. On those representations we perform an unsupervised clustering process, with the goal of studying the hidden semantics captured in the embedding space. Several abstract entities untaught to the network emerge in this process, effectively defining a taxonomy of knowledge as perceived by the CNN. We evaluate and interpret these sets using WordNet, while studying the different behaviours exhibited by the layers of a CNN model according to their depth. Our results indicate that, while top (i.e., deeper) layers provide the most representative space, low layers also define descriptive dimensions.