scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Performance and energy effects on task-based parallelized applications: User-directed versus manual vectorization

TL;DR: Results show that user-directed codes achieve manually optimized code performance and energy efficiency with minimal code modifications, favoring portability across different SIMD architectures.
Abstract: Heterogeneity, parallelization and vectorization are key techniques to improve the performance and energy efficiency of modern computing systems. However, programming and maintaining code for these architectures poses a huge challenge due to the ever-increasing architecture complexity. Task-based environments hide most of this complexity, improving scalability and usage of the available resources. In these environments, while there has been a lot of effort to ease parallelization and improve the usage of heterogeneous resources, vectorization has been considered a secondary objective. Furthermore, there has been a swift and unstoppable burst of vector architectures at all market segments, from embedded to HPC. Vectorization can no longer be ignored, but manual vectorization is tedious, error-prone and not practical for the average programmer. This work evaluates the feasibility of user-directed vectorization in task-based applications. Our evaluation is based on the OmpSs programming model, extended to support user-directed vectorization for different SIMD architectures (i.e., SSE, AVX2, AVX512). Results show that user-directed codes achieve manually optimized code performance and energy efficiency with minimal code modifications, favoring portability across different SIMD architectures.

Summary (2 min read)

1 Introduction

  • While transistor shrinking allows to include additional features and structures on the die, the increasing power density prevents the simultaneous usage of all available resources.
  • To facilitate the use of SIMD features, some programming models and languages have been extended to allow programmers to guide the compiler in the vectorization process.
  • OmpSs offers advanced features like socket-aware scheduling for NUMA architectures or pragma annotations to handle multiple dependence scenarios Both models (OpenMP and OmpSs) have virtually the same syntax, thus porting OpenMP code to OmpSs and vice versa is straightforward.
  • Section 3 shows their main experimental results and discussion.

2 Methodology

  • A) two manuallyvectorized implementations, one parallelized with the pthreads programming model [8] and one parallelized with the OmpSs programming model [7] (labeled pthreads and OmpSs, respectively), and b) a user-directed vectorization which is also based on OmpSs (labeled U.D.)the authors, also known as Including.
  • The authors use PAPI [12] to measure energy, L1D/L2/L3 cache miss-rate and total instruction count.
  • Nevertheless, this depends on the application’s algorithm.
  • This directive is used to instruct de compiler to vectorize the code, relaxing some restrictions that otherwise would prevent its vectorization.
  • Mercurium’s vectorization functionalities are still a work in progress, and as such, some of the situations require some code preparation to be done by the programmer.

Inst. Reduction 6 Th.

  • This feature is supported by new architectures, such as AVX2 (predicated loads/stores), AVX-512 and SVE, but not in SSE or NEON.
  • As such, Mercurium will not be able to directly produce code for these architectures.
  • Mercurium able to vectorize ternary operators using blend operations.
  • Therefore, “if-then-else” statements need to be transformed into simple ternary operators to be vectorized for old architectures.
  • Third, similar to manual vectorization, it is recommended to transform data structures from array-of-structures (AoS) into structure-of-arrays (SoA) Although there is some ongoing work to automatize this process [17,18], the authors reused the transformations that were already applied by the ParVec’s authors.

3 Evaluation

  • This section shows performance and energy measurements for a subset of the ParVec benchmarks [8]: blackscholes, canneal, streamcluster and swaptions.
  • All this manually vectorized code can be avoided by using the SIMD directives.
  • This is 4x more than when only relying on threading .
  • This clause is used to inform the compiler that loc cluster x and loc cluster y are aligned to the vector length boundary of the architecture.
  • Performance increases by a factor of 1.8x and 2.1x when using SSE and AVX instructions, respectively, with regard to the scalar version.

OmpSsPthreads U.D. OmpSsPthreads U.D. OmpSsPthreads U.D. OmpSsPthreads U.D. OmpSsPthreads U.D.

  • Improves the vectorization of certain loops with multi-dimensional arrays.
  • As for the results for L2 cache level , the authors see similar results for the OmpSs and the U.D. versions and different than the one for pthreads.
  • This results confirm the urgent need to apply manual prefetching for this benchmark.
  • With the instruction count reduction previously mentioned and the power consumption remaining constant, the authors clearly obtained an energy reduction proportional to the speedup.

Did you find this useful? Give us your feedback

Figures (14)

Content maybe subject to copyright    Report

The Journal of Supercomputing manuscript No.
(will be inserted by the editor)
User-directed vs. Manual Vectorization:
Performance and Energy Effects on
Task-based Parallelized Applications
Helena Caminal · Diego Caballero ·
Juan M. Cebri´an · Roger Ferrer ·
Marc Casas · Miquel Moret´o · Xavier
Martorell · Mateo Valero ·
Received: date / Accepted: date
Abstract Heterogeneity, parallelization and vectorization are key techniques
to improve the performance and energy efficiency of modern computing sys-
tems. However, programming and maintaining code for these architectures
poses a huge challenge due to the ever-increasing architecture complexity.
Furthermore, there has been a swift and unstoppable burst of vector archi-
tectures at all market segments, from embedded to HPC. Vectorization can no
longer be ignored, but manual vectorization is tedious, error-prone, and not
practical for programmers. This work evaluates the feasibility of user-directed
vectorization in task-based applications. Our evaluation is based on the OmpSs
programming model, extended to support user-directed vectorization for dif-
ferent Intel SIMD architectures (SSE, AVX2, IMCI and AVX-512). Results
show that user-directed codes achieve manually-optimized code performance
and energy efficiency with minimal code modifications, favoring portability
across different SIMD architectures.
Keywords SIMD · OmpSs · Performance · Vectorization · Energy Efficiency
1 Introduction
While transistor shrinking allows to include additional features and structures
on the die, the increasing power density prevents the simultaneous usage of all
available resources. Instruction level parallelism (ILP) importance subsides,
Helena Caminal, Juan M. Cebri´an, Marc Casas, Miquel Moret´o, Xavier Martorell, Mateo
Valero
E-mail: first.last@bsc.es
Diego Caballero
E-mail: diego@ac.upc.edu
Roger Ferrer
E-mail: rofirrim@gmail.com
This is a post-peer-review, pre-copyedit version of an article published in Journal of supercomputing.
The final authenticated version is available online at: https://doi.org/10.1007/s11227-018-2294-9

2 Helena Caminal et al.
while data level parallelism (DLP) becomes a critical factor to improve the
energy efficiency of microprocessors. Among other features, SIMD instructions
have been gradually included in microprocessors for various market segments,
from mobile (ARM NEON technology [1]) to high performance computing
(Intel AVX-512 [2], ARM’s Scalable Vector Extension [3] or PowerPC Altivec
technology [4]). Each new generation includes more sophisticated, powerful
and flexible instructions. This high investment in SIMD resources per core,
specially in terms of area and power, makes extracting the full computational
power of these vector units more important than ever.
From the programmers point of view, SIMD units can be exploited in
several ways, including: a) compiler auto-vectorization, b) low-level intrinsics
or assembly code and c) programming models/languages with explicit SIMD
support. Auto-vectorization in compilers has strong limitations in the analysis
and code transformations phases that prevent an efficient extraction of SIMD
parallelism in real applications [5]. Low-level hardware-specific intrinsics en-
able developers to fine tune their applications by providing direct access to all
of the SIMD features of the hardware. However, the use of intrinsics is time-
consuming, tedious and error-prone even for advanced programmers. Manual
vectorization forces programmers to be knowledgeable about the offered SIMD
instructions, and that becomes even more complicated with CISC ISAs. To fa-
cilitate the use of SIMD features, some programming models and languages
have been extended to allow programmers to guide the compiler in the vec-
torization process. For example, OpenMP 4.5 [6] offers a set of directives to
describe vectorizable regions. This approach is high-level, orthogonal to the
actual code and portable across different SIMD architectures.
The OpenMP 4.5 standard supports tasking and data dependencies. Par-
allelism is described by a directed acyclic graph where each node is a task and
the edges between nodes represent dependencies, which are explicitly anno-
tated by the programmer. Such annotations also provide the opportunity for
the runtime system to automatically offload tasks to accelerators like GPU’s or
Intel Xeon Phi co-processors. The runtime system is empowered to take care
of data movements without the need of specific programming intervention be-
sides annotating each task input and output dependencies. Also, the runtime
system may deploy some optimizations like data prefetching or overlapping
of computation and communication. It also enables the possibility to exploit
data locality in distributed-cache architectures, by allocating computational
resources near the cache partition where data resides.
In this article, we evaluate the efficiency of user-directed vectorization using
OmpSs [7], developed at the Barcelona Supercomputing Center. OmpSs is
a data-flow programming model, similar to OpenMP, that eases application
porting to the heterogeneous architectures. Nanos++ [7] is used as runtime
system for the OmpSs programming model. OmpSs offers advanced features
like socket-aware scheduling for NUMA architectures or pragma annotations
to handle multiple dependence scenarios Both models (OpenMP and OmpSs)
have virtually the same syntax, thus porting OpenMP code to OmpSs and
vice versa is straightforward.

User-directed vs. Manual Vectorization 3
Our main contributions include:
Development of a task-based version of a subset of benchmarks from the
ParVec benchmark suite [8]. As discussed by ParVec authors, benchmarks
can be classified in scalable (S), resource limited (RL) and code/input lim-
ited (CI). We chose representative benchmarks that cover this classification:
Blackscholes (S), Canneal, Streamcluster (RL) and Swaptions (CI).
We present the code modifications necessary to generate a user-directed
code version that achieves similar performance and energy results to those
obtained with manual vectorization.
We discuss our findings and proposed improvements for both the manu-
ally vectorized versions and the user-directed vectorization module in the
Mercurium [9] source-to-source compiler.
This article is organized as follows. Section 2 introduces our evaluation
methodology. Section 3 shows our main experimental results and discussion.
Section 4 presents a brief summary of the related work on SIMD benchmarking
and programming models. Finally, Section 5 shows our concluding remarks and
future work.
2 Methodology
In this paper we evaluate three versions of codes, including: a) two manually-
vectorized implementations, one parallelized with the pthreads programming
model [8] and one parallelized with the OmpSs programming model [7] (labeled
pthreads and OmpSs, respectively), and b) a user-directed vectorization
which is also based on OmpSs (labeled U.D.) We initially tested automatic
vectorization on the original scalar code but it resulted in no performance or
energy improvements. Both user-directed and OmpSs versions were developed
for this paper. Within the three versions, we have targeted the same loops and
functions for vectorization, being the performance and energy consumption of
the three versions comparable.
The pthreads codes have been compiled with ICC 14.0 and both the OmpSs
and the U.D. codes are compiled with the Mercurium compiler [9]. Mercurium
is a research source-to-source compiler with support for C, C++, and FOR-
TRAN programming languages, and OpenMP [6], OmpSs [7] and StarSs [10]
programming models, among others. We have extended the Mercurium source-
to-source infrastructure to support the directives used in the U.D. codes [11].
Mercurium’s vectorizer recognizes user annotations on the code to produce
a SIMD version of the scalar code. Binaries are then built and linked using
the Intel Compiler C/C++ as a back-end. We use -no-vec flag to isolate our
results from the automatic vectorization performed by the Intel compiler. Fur-
ther details on the building infrastructure can be found in Table 1. For each
benchmark, we only take measurements from the Region of Interest (ROI) to
ignore the initialization and finalization parts of the applications.
The evaluation platform is a dual-socket E5-2603v3 processor running at
1.60GHz, with a total of 12 cores, 30MB of L3 cache and 64GB of DDR3.

4 Helena Caminal et al.
pthreads OmpSs, U.D.
Front-end
compiler
Intel Compiler
C/C++ 14.0.1
Mercurium
Back-end
compiler
Intel Compiler
C/C++ 14.0.1
Flags
C/C++ codes
[C++ codes]
-03, -no-vec,
-funroll-loops -qopt-prefetch,
[-fpermissive, -fno-exceptions]
Mathematical
libraries
Short VectorMath Library(SVML)
Table 1 Building infrastructure and configuration for pthreads, OmpSs and U.D. codes.
We use PAPI [12] to measure energy, L1D/L2/L3 cache miss-rate and total
instruction count. The E5-2603v3 only provides energy information for the
whole socket, since the power plane 0 is disabled (the one that offers energy
results for the cores). The reported energy numbers account for both sockets.
The system runs CentOS 6.5 with Nanox 0.7.12a as runtime for OmpSs.
We have tested three different input sizes for the benchmarks: native, sim-
large and simsmall. Overall, L1, L2 and L3 cache measurements are affected by
the input size, emphasizing the differences between the OmpSs/U.D. versions
versus the pthreads version. For that reason, we encourage the research com-
munity to use the largest inputs possible even when they are using simulation
tools. Results will be clearer and more significant for the different program-
ming models. In terms of vectorization, bigger input sizes usually favor the use
of longer vectors leading to a better performance and energy improvements.
Nevertheless, this depends on the application’s algorithm.
Manual Vectorization: The manually vectorized (pthreads and OmpSs) codes
make use of a wrapper library [13] that provides generic vector intrinsics. These
intrinsics are translated to architecture-specific intrinsics at compile time. Vec-
tor instructions that are not supported in the target ISA are emulated to have
the same functionality. For further details on ParVec benchmark specifics and
the manual vectorization process refer to [8] for further details. Figure 2 (bot-
tom) shows the manually vectorized version of the dist function (top) included
in the streamcluster benchmark. Note that the transformation is based on a
direct translation from the scalar operations (-, *, +) to their equivalent vector
intrinsic (
MM SUB, MM MUL, MM ADD). The library will then translate
those calls to ISA-specific intrinsics (e.g., MM ADD to mm add ps for floats
in SSE, or mm256 add ps for floats using AVX). This increments portability
across different architectures and abstracts the low-level details to the pro-
grammer. In addition, we do a vector load ( MM LOADU ) and we increment
the iteration count by a generic SIMD WIDTH.
User-Directed Vectorization: The vectorization infrastructure implemented in
Mercurium is divided in two main phases: Vectorizer and Vector Lowering.
Vectorizer is in charge of transforming the scalar input code into a generic
vector representation, within the compiler middle-end stage. Later, the vector

User-directed vs. Manual Vectorization 5
f l o a t d i s t ( Point p1 , P o i n t p2 , i nt dim ) {
i n t i ;
f l o a t r e s u l t =0.0 f ;
f o r ( i =0; i <dim ; i ++) {
r e s u l t += ( p1 . coor d [ i ] p2 . c o o r d [ i ] ) ( p1 . c o o r d [ i ] p2 . c oord [ i ] ) ;
}
return ( r e s u l t ) ;
}
f l o a t d i s t ( Point p1 , P o i n t p2 , i nt dim ) {
i n t i ;
MM TYPE r e s u l t , aux , d i f f , co o r d 1 , c oo r d2 ;
r e s u l t = MM SETZERO ( ) ;
f o r ( i =0; i <dim ; i=i+SIMD WIDTH) {
c o or d 1 = MM LOADU(&(p1 . coor d [ i ] ) ) ;
c o or d 2 = MM LOADU(&(p2 . coor d [ i ] ) ) ;
d i f f = MM SUB( c o ord1 , c o or d 2 ) ;
aux = MM MUL ( d i f f , d i f f ) ;
r e s u l t = MM ADD( r e s u l t , au x ) ;
}
return ( ( f l o a t ) MM REDUCE ADD( r e s u l t ) ) ;
}
Fig. 1 Manual vectorization example over C code (top) using the wrapper library.
#pragma omp simd [ c l a u s e [ c l a u s e ] . . . ] new l i n e
f o r lo op fu nc ti o n d e c l fu nc ti o n d e f
Fig. 2 C/C++ syntax of the standalone simd construct.
lowering phase generates architecture specific SIMD intrinsics. The vectoriza-
tion algorithm is based on the traditional strip-mining/unroll-and-jam loop
vectorization approach [14,15]. This algorithm vectorizes two kinds of code
structures: loops and functions. The simplest construct to describe SIMD par-
allelism is the pragma omp simd directive, placed on top of one of these code
structures (Figure 2). This directive is used to instruct de compiler to vectorize
the code, relaxing some restrictions that otherwise would prevent its vectoriza-
tion. For that purpose, the compiler will assume that the vectorization is safe
and profitable without running any legality and cost model analyses. OmpSs
provides optional clauses to offer further information to the compiler about
the target code, such as the aligned clause, the suitable clause and the
vectorlength clause, among others. More detail will be given on the clauses
used for each benchmark case. Please, refer to Caballero de Gea’s work [11]
for further details on Mercurium’s vectorizer. Figure 3 shows a use case of
the compiler directive on the dist function of the streamcluster benchmark.
The addition of the optional clause reduction(+:result) to the standalone
directive annotating the loop statement is enough to automatically vectorize
the code. The reduction clause follows the same style as the parallel constructs
used for threads. It generates a scalar result to be stored in the result vari-
able using the specified reduction operation (+) on to the scalar values of each
lane. Redundant instructions are combined by the Backend.

Citations
More filters
Proceedings ArticleDOI
21 Sep 2020
TL;DR: This paper proposes an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPs reduction operations.
Abstract: As the scale of high-performance computing (HPC) systems continues to grow, researchers are devoted themselves to explore increasing levels of parallelism to achieve optimal performance. The modern CPU’s design, including its features of hierarchical memory and SIMD/vectorization capability, governs algorithms’ efficiency. The recent introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become of critical importance to increase efficiency and close the gap to peak performance. In this paper, we propose an implementation of predefined MPI reduction operations utilizing AVX, AVX2 and AVX-512 intrinsics to provide vector-based reduction operation and to improve the time-to-solution of these predefined MPI reduction operations. With these optimizations, we achieve higher efficiency for local computations, which directly benefit the overall cost of collective reductions. The evaluation of the resulting software stack under different scenarios demonstrates that the solution is at the same time generic and efficient. Experiments are conducted on an Intel Xeon Gold cluster, which shows our AVX-512 optimized reduction operations achieve 10X performance benefits than Open MPI default for MPI local reduction.

7 citations


Cites background from "Performance and energy effects on t..."

  • ...While ILP importance subsides DLP becomes a critical factor in improving the efficiency of microprocessors [9, 12, 29, 32, 38]....

    [...]

Proceedings ArticleDOI
26 Sep 2018
TL;DR: Performance, power, and energy measurements for all program versions are provided for the Intel Sandy Bridge, Haswell and Skylake architectures and the results are discussed and analyzed.
Abstract: The energy efficiency of program executions is an active research field in recent years and the influence of different programming styles on the energy consumption is part of the research effort. In this article, we concentrate on SIMD programming and study the effect of vectorization on performance as well as on power and energy consumption. Especially, SIMD programs using AVX instructions are considered and the focus is on the AVX load and store instruction set. Several semantically similar but different load and store instructions are selected and are used to build different program versions of for the same algorithm. As example application, the Gaussian elimination has been chosen due to its interesting feature of using arrays of varying length in each factorization step. Five different SIMD program versions of the Gaussian elimination have been implemented, each of which uses different load and store instructions. Performance, power, and energy measurements for all program versions are provided for the Intel Sandy Bridge, Haswell and Skylake architectures and the results are discussed and analyzed.

7 citations

Journal ArticleDOI
TL;DR: In this paper , an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) is proposed to improve the time-to-solution of the pre-defined MPI reductions operations.

2 citations

References
More filters
Proceedings ArticleDOI
04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

2,697 citations


"Performance and energy effects on t..." refers background in this paper

  • ...The RODINIA [23] and ALPBench [24] benchmark suites also offer limited SIMD support, but not in the same scenario that we explore....

    [...]

  • ...The RODINIA [11] and ALPBench [12] benchmark suites also offer limited SIMD support, but not in the same scenario that we explore....

    [...]

Book
10 Oct 2001
TL;DR: A broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling are provided.
Abstract: Modern computer architectures designed with high-performance microprocessors offer tremendous potential gains in performance over previous designs. Yet their very complexity makes it increasingly difficult to produce efficient code and to realize their full potential. This landmark text from two leaders in the field focuses on the pivotal role that compilers can play in addressing this critical issue. The basis for all the methods presented in this book is data dependence, a fundamental compiler analysis tool for optimizing programs on high-performance microprocessors and parallel architectures. It enables compiler designers to write compilers that automatically transform simple, sequential programs into forms that can exploit special features of these modern architectures. The text provides a broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling. The authors demonstrate the importance and wide applicability of dependence-based compiler optimizations and give the compiler writer the basics needed to understand and implement them. They also offer cookbook explanations for transforming applications by hand to computational scientists and engineers who are driven to obtain the best possible performance of their complex applications.

1,087 citations


"Performance and energy effects on t..." refers methods in this paper

  • ...The vectorization algorithm is based on the traditional strip-mining/unroll-and-jam loop vectorization approach [14,15]....

    [...]

Journal ArticleDOI
TL;DR: OmpSs is a programming model based on OpenMP and StarSs that can also incorporate the use of OpenCL or CUDA kernels, that is more flexible than traditional approaches to exploit multiple accelerators, and due to the simplicity of the annotations, it increases programmer's productivity.
Abstract: In this paper, we present OmpSs, a programming model based on OpenMP and StarSs, that can also incorporate the use of OpenCL or CUDA kernels. We evaluate the proposal on different architectures, SMP, GPUs, and hybrid SMP/GPU environments, showing the wide usefulness of the approach. The evaluation is done with six different benchmarks, Matrix Multiply, BlackScholes, Perlin Noise, Julia Set, PBPI and FixedGrid. We compare the results obtained with the execution of the same benchmarks written in OpenCL or OpenMP, on the same architectures. The results show that OmpSs greatly outperforms both environments. With the use of OmpSs the programming environment is more flexible than traditional approaches to exploit multiple accelerators, and due to the simplicity of the annotations, it increases programmer's productivity.

625 citations


"Performance and energy effects on t..." refers methods in this paper

  • ...vectorized implementations, one based on pthreads [2] and one based on the OmpSs programming model [4] (labeled pthreads and OmpSs, respectively), and (b) a userdirected vectorization (labeled U....

    [...]

Proceedings Article
01 Jan 1999
TL;DR: The purpose of the PAPI project is to specify a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors.
Abstract: The purpose of the PAPI project is to specify a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals related to the processor’s function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis including hand tuning, compiler optimization, debugging, benchmarking, monitoring and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development towards alleviating commonly occurring bottlenecks in high performance computing.

469 citations


"Performance and energy effects on t..." refers methods in this paper

  • ...We use PAPI [6] to measure energy, L1D/L2/L3 cache miss rate and total instruction count....

    [...]

  • ...We use PAPI [12] to measure energy, L1D/L2/L3 cache miss-rate and total instruction count....

    [...]

Proceedings ArticleDOI
10 Oct 2011
TL;DR: Evaluated how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II shows that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers.
Abstract: Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.

209 citations


"Performance and energy effects on t..." refers background in this paper

  • ...that prevent an efficient extraction of SIMD parallelism in real applications [1]....

    [...]

Frequently Asked Questions (2)
Q1. What are the contributions in "User-directed vs. manual vectorization: performance and energy effects on task-based parallelized applications" ?

This work evaluates the feasibility of user-directed vectorization in task-based applications. 

As future work, the authors aim to improve on energy awareness for runtime systems running on multisocket machines. Their future line of work will be to extend the evaluation of user-directed performance on other applications while the authors extend Mercurium with additional features.