scispace - formally typeset
Search or ask a question

Showing papers presented at "Irregular Applications: Architectures and Algorithms in 2013"


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This paper investigates the power, energy, and performance characteristics of large-scale graph processing on hybrid (i.e., CPU and GPU) single-node systems and shows that a hybrid system is efficient in terms of both time-to-solution and energy.
Abstract: This paper investigates the power, energy, and performance characteristics of large-scale graph processing on hybrid (i.e., CPU and GPU) single-node systems. Graph processing can be accelerated on hybrid systems by properly mapping the graph-layout to processing units, such that the algorithmic tasks exercise each of the units where they perform best. However, the GPUs have much higher Thermal Design Power (TDP), thus their impact on the overall energy consumption is unclear. Our evaluation using large real-world graphs and synthetic graphs as large as 1 billion vertices and 16 billion edges shows that a hybrid system is efficient in terms of both time-to-solution and energy.

33 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This paper analyzes distinct memory accessing models and proposes two methods to employ highly efficient bitonic merge sort using SIMD instructions as register level sort, which achieves nearly 270x speedup on a 4M integer set using Xeon Phi coprocessor.
Abstract: State-of-the-art hardware increasingly utilizes SIMD parallelism, where multiple processing elements execute the same instruction on multiple data points simultaneously. However, irregular and data intensive algorithms are not well suited for such architectures. Due to their importance, it is crucial to obtain efficient implementations. One example of such a task is sort, a fundamental problem in computer science. In this paper we analyze distinct memory accessing models and propose two methods to employ highly efficient bitonic merge sort using SIMD instructions as register level sort. We achieve nearly 270x speedup (525M integers/s) on a 4M integer set using Xeon Phi coprocessor, where SIMD level parallelism accelerates the algorithm over 3 times. Our method can be applied to any device supporting similar SIMD instructions.

17 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This study investigates two aspects of the TSP on multicore, NUMA, and manycore processors, and shows that applications able to fully use the resources of a manycore can have better performance and may consume 9.8 and 13 times less energy when compared to low-power and general-purpose multicore processors, respectively.
Abstract: The exponential growth in processor performance seems to have reached a turning point. Nowadays, energy efficiency is as important as performance and has become a critical aspect to the development of scalable systems. These strict energy constraints paved the way for the development of multi and manycore processors. Research on the performance and the energy efficiency of numerical kernels on multicores are common but studies in the context of manycores are sparse. Unlike these works, in this paper we analyze a well-known irregular NP-complete problem, the Traveling-Salesman Problem (TSP). This study investigates two aspects of the TSP on multicore, NUMA, and manycore processors. First, we concentrate on the nontrivial task of adapting this application to a manycore, specifically the novel MPPA-256 manycore processor. Then, we analyze its performance and energy consumption on different platforms that comprise general-purpose and low-power multicores, a NUMA machine, and the MPPA-256 manycore. Our results show that applications able to fully use the resources of a manycore can have better performance and may consume 9.8 and 13 times less energy when compared to low-power and general-purpose multicore processors, respectively.

13 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This paper aims at obtaining a quantitative understanding of the achievable GPU performance of finite volume computations in the context of the cell-centered finite volume method on 3D unstructured tetrahedral meshes by using an optimized implementation and a synthetic connectivity matrix that closely relates the achievable computing performance to the size of these diagonal blocks.
Abstract: Finite volume methods are widely used numerical strategies for solving partial differential equations. This paper aims at obtaining a quantitative understanding of the achievable GPU performance of finite volume computations in the context of the cell-centered finite volume method on 3D unstructured tetrahedral meshes. By using an optimized implementation and a synthetic connectivity matrix that exhibits a perfect structure of equal-sized blocks lying on the main diagonal, we can closely relate the achievable computing performance to the size of these diagonal blocks. Moreover, we have derived a theoretical model for identifying characteristic levels of the attainable performance as function of the GPU's key hardware parameters. A realistic upper limit of the performance can thus be accurately predicted. For real-world tetrahedral meshes, the key to high performance lies in a reordering of the tetrahedra, such that the resulting connectivity matrix resembles a block diagonal form where the optimal size of the blocks depends on the GPU hardware. Performance can then be predicted accurately based on the success of the reordering. Numerical experiments confirm that the achieved performance is close to the practically attainable maximum and it reaches 75% of the theoretical upper limit, independent of the actual tetrahedral mesh considered.

10 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: The authors' parallel sFFT (PsFFT) implementation achieves approximately 60% parallel efficiency on a single 8-core Intel Sandy Bridge socket for relevant test cases and applies several techniques such as index coalescing, data affiliated loops and multi-level blocking techniques to alleviate memory access congestion and increase performance.
Abstract: The Fast Fourier Transform (FFT) is a widely used numerical algorithm. When N input data points lead to only k

9 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: A high performance in-memory lossless data compression scheme designed to save both memory storage and bandwidth for general sparse matrices, on multicore CPUs and modern GPUs is presented.
Abstract: We present a high performance in-memory lossless data compression scheme designed to save both memory storage and bandwidth for general sparse matrices. Because the storage hierarchy is increasingly becoming the limiting factor in overall delivered machine performance, this type of data structure compression will become increasingly important. Compared to conventional compressed sparse row (CSR) using 32-bit column indices, compressed column indices (CCI) can be over 90% smaller, yet still be decompressed at tens of gigabytes per second. We present time and space savings for 20 standard sparse matrices, on multicore CPUs and modern GPUs.

9 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This work investigates the nonzero patterns and memory access patterns in sparse LU factorization, and explores the common features to give guidelines on the improvements of the GPU solvers.
Abstract: The sparse matrix solver is a critical component in circuit simulators. Some researches have developed GPU-based LU factorization approaches to accelerate the sparse solver. But the performance of these solvers is constrained by the irregularities of sparse matrices. This work investigates the nonzero patterns and memory access patterns in sparse LU factorization, and explores the common features to give guidelines on the improvements of the GPU solvers. We further propose a crisscross blocked implementation on GPUs. The proposed method attains average speedups of 1.68× compared with the unblocked method and 2.2× compared with 4-threaded PARDISO, for circuit matrices.

3 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This work outlines a new algorithm for achieving a highly parallel assembler routine compatible with Intel® Xeon Phi and GPU architectures and presents performance comparison and analysis of this algorithm and the globalNZ algorithm outlined by Cecka et al.
Abstract: Finite element method (FEM) is a popular approach to solving Differential equations [5]. Among its many attractive features is its ability to handle complex geometries. The domain is discretised using simple elements whose local contributions are assembled into a global system of equations. This is in contrast to the finite difference method (FDM) which can typically only handle regular geometries. However before solution is possible the system of equations of the FEM has to be assembled, a procedure which can be significant to the computational performance of the FEM solver, particularly when coupled with highly parallel execution [3]. In this work we outline a new algorithm for achieving a highly parallel assembler routine compatible with Intel® Xeon Phi and GPU architectures. We also present performance comparison and analysis of our algorithm and the globalNZ algorithm outlined by Cecka et al. in [2], as implemented on Intel® Xeon Phi architecture and compare these to the serial implementation of Hughes [5].

2 citations


Proceedings ArticleDOI
Guojing Cong1, Huifang Wen1
17 Nov 2013
TL;DR: This work proposes a locality-centric optimization for simultaneously reducing remote accesses and improving cache performance, and improves geographical locality by matching access pattern to the data layout.
Abstract: In modern shared-memory systems, the communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and hardware threads. Thus the performance of multithreaded applications varies with the mapping of software threads to logical processors.In our study we observe huge variation in application performance under different mappings. Moreover, applications with irregular access patterns perform poorly under the default mapping. We maximize application performance by balancing communication overhead and available resources.Remote access overhead in irregular applications dominates execution time and can not be reduced by mapping alone on NUMA systems when the logical processors span multiple chips. In addition to new data replication and distribution optimizations, we improve geographical locality by matching access pattern to the data layout. We further propose a locality-centric optimization for simultaneously reducing remote accesses and improving cache performance. Our approach achieves better performance than prior NUMA-specific techniques.

2 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: This work has developed a methodology to use architectural simulators to assess the performance of different AMR data placement strategies on a selection of potential hardware interconnect topologies for exascale-class supercomputers.
Abstract: The ability to predict the performance of irregular, asynchronous applications on future hardware is essential to the exascale co-design process. Adaptive Mesh Refinement (AMR) applications are inherently irregular and dynamic in their computation and communication patterns, resulting in complex hardware/software interactions. We have developed a methodology to use architectural simulators to assess the performance of different AMR data placement strategies on a selection of potential hardware interconnect topologies for exascale-class supercomputers. We use our framework to study the CASTRO AMR compressible astrophysics code for the simulation of supernovae. The results show a performance improvement of up to 18 percent may be obtained through the use of locality-aware data distributions for some network topologies on an exascale-class supercomputer.

2 citations


Proceedings ArticleDOI
17 Nov 2013
TL;DR: A traffic generation model is proposed that captures the interactions and dependencies between computation and communication for a given application and remains tractable enough to be implemented on top of detailed photonic network simulators as well as general enough to analyze different components of the network under investigation.
Abstract: With vastly increasing system parallelism, energy efficient data movement has emerged as one of the key challenges in High Performance Computing (HPC). Optics offers the potential for creating system-wide interconnection networks with extremely high bandwidth and energy efficiency. To reap the significant benefits of optical data movement, the interconnect design must go beyond a simple wire replacement to include a fully networked architecture. Simulation is an essential tool in the associated architecture design space exploration. However, simulation requires appropriate models for capturing the relevant interactions between parallel application communication and network operations, as well as the photonic physical layer integrity. An important goal is to keep the simulation as simple as possible to enable a wide design space exploration. In this paper, we propose a traffic generation model that captures the interactions and dependencies between computation and communication for a given application. Our proposed model remains tractable enough to be implemented on top of detailed photonic network simulators as well as general enough to analyze different components of the network under investigation.

Proceedings ArticleDOI
17 Nov 2013
TL;DR: Simulations of efficient approaches to the scheduling and grid partitioning problem for ensemble assimilation are presented and prospects for implementation on accelerator architectures are discussed.
Abstract: Numerical models are used to find approximate solutions to the coupled nonlinear partial differential equations associated with the prediction of the atmosphere. The model state can be represented by a grid of discrete values; subsets of grid points are assigned to tasks for parallel solution. Data assimilation algorithms are used to combine information from a model forecast with atmospheric observations to produce an improved state estimate. Observations are irregular in space and time, for instance following the track of a polar orbiting satellite. Ensemble assimilation algorithms use statistics from a set (ensemble) of forecasts to update the model state. All the challenges of heterogeneous grid computing and partitioning for atmospheric models are in play. In addition, the heterogeneous distribution of observations in space and time is a further source of irregular computing load while ensembles lead to increased storage and an additional communication pattern. Adjacent observations cannot be assimilated simultaneously leading to a mutual exclusion scheduling problem that interacts with the grid partitioning communication patterns and load balancing. Simulations of efficient approaches to the scheduling and grid partitioning problem for ensemble assimilation are presented. Prospects for implementation on accelerator architectures are also discussed.