Q2. What are the contributions in "A high performance data parallel tensor contraction framework: application to coupled electro-mechanics" ?
The paper presents aspects of implementation of a new high performance tensor contraction framework for the numerical analysis of coupled and multi-physics problems on streaming architectures. In addition to explicit SIMD instructions and smart expression templates, the framework introduces domain specific constructs for the tensor cross product and its associated algebra recently rediscovered by Bonet et. al. [ 1, 2 ] in the context of solid mechanics. The two key ingredients of the presented expression template engine are as follows. Every aspect of the framework is examined through relevant performance benchmarks, including the impact of data parallelism on the performance of isomorphic and nonisomorphic tensor products, the FLOP and memory I/O optimality in the evaluation of tensor networks, the compilation cost and memory footprint of the framework and the performance of tensor cross product kernels. In this context, domain-aware expression templates are shown to provide a significant speed-up over the classical low-level style programming techniques. First, the capability to mathematically transform complex chains of operations to simpler equivalent expressions, while potentially avoiding routes with higher levels of computational complexity and, second, to perform a compile time depth-first search to find the optimal contraction indices of a large tensor network in order to minimise the number of floating point operations.
Q3. What are the future works mentioned in the paper "A high performance data parallel tensor contraction framework: application to coupled electro-mechanics" ?
To study the various aspects of the above optimisation levels, a singleton comprising of one 7th order tensor A and one 8th order tensor B is considered. The goal here is, to study Fastor ’ s internal optimisation schemes with realistic compiler flags ( also in order to be consistent with the other benchmarks ). Further build profiling reveals that unlike ICC and Clang, GCC stores up all large variadic templates and static arrays on the stack in order to perform global optimisation for fixed indices, but does not optimise the memory I/O. A deeper insight can be gained through a comparison of different optimisation levels presented in Table 2 20 Next, the compilation aspect of operation minimisation is studied.
Q4. What is the fundamental design principle of all tensor frameworks?
The fundamental design principle that all tensor frameworks rely on is the concept of expression templates in C++ [13, 34, 35], which provides a powerful means for lazy or on-demand evaluation of arbitrary chained operators as well as delaying the evaluation of certain tensor algebraic operations.
Q5. What is the way to guarantee the stability of the basis functions?
For high order elements, nodal Lagrange basis functions with optimal nodal placements [60, 72] are chosen, to guarantee the stability and p-convergence property of the basis functions.
Q6. Why is the optimisation level -DOPT not available?
Note that data for GCC 6.2.0 for 4 index contraction and lower is not available for optimisation level -DOPT=2, due to stall and excessive memory footprint.
Q7. What is the internal level of optimisation used for these benchmarks?
This optimisation level is indeed equivalent to writing the contraction loop nest explicitly as multiple nested for loops and relying on the compiler for further optimisations.
Q8. What can be done to reduce the compilation time of a tensor?
As described in subsection 3.6, generating the Cartesian product of iteration space and further the indices of tensors metaprogrammatically can lead to an increase in compilation time.
Q9. What is the point of departure for the tensor contraction framework?
In the next subsections, the multiple stages of designing a tensor contraction framework using modern C++ features are presented, with the point of departure being the explicit SIMD vector types.