A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics

Question

Q1. How many FLOPs are needed for the by-pair tensor contraction scheme?

Q2. What are the contributions in "A high performance data parallel tensor contraction framework: application to coupled electro-mechanics" ?

Q3. What are the future works mentioned in the paper "A high performance data parallel tensor contraction framework: application to coupled electro-mechanics" ?

Q4. What is the fundamental design principle of all tensor frameworks?

Q5. What is the way to guarantee the stability of the basis functions?

Q6. Why is the optimisation level -DOPT not available?

Q7. What is the internal level of optimisation used for these benchmarks?

Q8. What can be done to reduce the compilation time of a tensor?

Q9. What is the point of departure for the tensor contraction framework?

Accepted Answer

for L3 cache, a reduction of 106 and for tensor networks not fitting in any cache, a reduction of (107) in floating point operations is required for the by-pair tensor contraction scheme to be beneficial.

Accepted Answer

The paper presents aspects of implementation of a new high performance tensor contraction framework for the numerical analysis of coupled and multi-physics problems on streaming architectures. In addition to explicit SIMD instructions and smart expression templates, the framework introduces domain specific constructs for the tensor cross product and its associated algebra recently rediscovered by Bonet et. al. [ 1, 2 ] in the context of solid mechanics. The two key ingredients of the presented expression template engine are as follows. Every aspect of the framework is examined through relevant performance benchmarks, including the impact of data parallelism on the performance of isomorphic and nonisomorphic tensor products, the FLOP and memory I/O optimality in the evaluation of tensor networks, the compilation cost and memory footprint of the framework and the performance of tensor cross product kernels. In this context, domain-aware expression templates are shown to provide a significant speed-up over the classical low-level style programming techniques. First, the capability to mathematically transform complex chains of operations to simpler equivalent expressions, while potentially avoiding routes with higher levels of computational complexity and, second, to perform a compile time depth-first search to find the optimal contraction indices of a large tensor network in order to minimise the number of floating point operations.

Accepted Answer

To study the various aspects of the above optimisation levels, a singleton comprising of one 7th order tensor A and one 8th order tensor B is considered. The goal here is, to study Fastor ’ s internal optimisation schemes with realistic compiler flags ( also in order to be consistent with the other benchmarks ). Further build profiling reveals that unlike ICC and Clang, GCC stores up all large variadic templates and static arrays on the stack in order to perform global optimisation for fixed indices, but does not optimise the memory I/O. A deeper insight can be gained through a comparison of different optimisation levels presented in Table 2 20 Next, the compilation aspect of operation minimisation is studied.

Accepted Answer

The fundamental design principle that all tensor frameworks rely on is the concept of expression templates in C++ [13, 34, 35], which provides a powerful means for lazy or on-demand evaluation of arbitrary chained operators as well as delaying the evaluation of certain tensor algebraic operations.

Accepted Answer

For high order elements, nodal Lagrange basis functions with optimal nodal placements [60, 72] are chosen, to guarantee the stability and p-convergence property of the basis functions.

Accepted Answer

Note that data for GCC 6.2.0 for 4 index contraction and lower is not available for optimisation level -DOPT=2, due to stall and excessive memory footprint.

Accepted Answer

This optimisation level is indeed equivalent to writing the contraction loop nest explicitly as multiple nested for loops and relying on the compiler for further optimisations.

Accepted Answer

As described in subsection 3.6, generating the Cartesian product of iteration space and further the indices of tensors metaprogrammatically can lead to an increase in compilation time.

Accepted Answer

In the next subsections, the multiple stages of designing a tensor contraction framework using modern C++ features are presented, with the point of departure being the explicit SIMD vector types.

A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics

Figures

Citations

A curvilinear high order finite element framework for electromechanics: From linearised electro-elasticity to massively deformable dielectric elastomers

A Pipeline Computing Method of SpTV for Three-Order Tensors on CPU and GPU

On a family of numerical models for couple stress based flexoelectricity for continua and beams

A reduced mixed finite-element formulation for modeling the viscoelastic response of electro-active polymers at finite deformation:

Fourth-order tensor algebraic operations and matrix representation in continuum mechanics

References

Introduction to Algorithms

Artificial Intelligence: A Modern Approach

Introduction to algorithms: 4. Turtle graphics

OpenMP: an industry standard API for shared-memory programming

Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book

Related Papers (5)

A performance optimization framework for compilation of tensor contraction expressions into parallel

Towards compositional and generative tensor optimizations

Advanced expression templates programming

A Framework for Generating Distributed-Memory Parallel Programs for Block Recursive Algorithms

LinearOperator - A generic, high-level expression syntax for linear algebra

Frequently Asked Questions (9)