scispace - formally typeset
Search or ask a question
ReportDOI

Toward a New Metric for Ranking High Performance Computing Systems

TL;DR: A new high performance conjugate gradient (HPCG) benchmark is described, composed of computations and data access patterns more commonly found in applications that strive for a better correlation to real scientific application performance.
Abstract: The High Performance Linpack (HPL), or Top 500, benchmark [1] is the most widely recognized and discussed metric for ranking high performance computing systems However, HPL is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications In this paper we describe a new high performance conjugate gradient (HPCG) benchmark HPCG is composed of computations and data access patterns more commonly found in applications Using HPCG we strive for a better correlation to real scientific application performance and expect to drive computer system design and implementation in directions that will better impact performance improvement

Summary (1 min read)

1. INTRODUCTION

  • The High Performance Linpack (HPL) benchmark is the most widely recognized and discussed metric for ranking high performance computing systems.
  • At the same time HPL rankings of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by differential equations, which tend to have much stronger needs for high bandwidth and low latency, and tend to access data using irregular patterns.
  • While Type 1 patterns are commonly found in real applications, additional computations and access patterns are also very common.

3. REQUIREMENTS

  • Any new metric the authors introduce must satisfy a number of requirements.
  • The ranking of computer systems using the new metric must correlate strongly to how their real applications would rank these same systems.
  • Drive improvements to computer systems to benefit their applications:.
  • The authors will perform thorough validation testing of any proposed benchmark against a suite of applications on current high-end systems using techniques similar to those identified in the Mantevo project [3].
  • The authors will furthermore specify restrictions on changes to the reference version of the code to ensure that only changes that have relevance to their application base are permitted.

4. A PRECONDITIONED CONJUGATE GRADIENT BENCHMARK

  • As the candidate for a new HPC metric, the authors consider the preconditioned conjugate gradient (PCG) method with a local symmetric Gauss-Seidel preconditioner .
  • Set up data structures for the local symmetric Gauss-Seidel preconditioner.
  • By doing this the authors can compare the numerical results for “correctness” at the end of each m-iteration phase.
  • Since many large-scale applications use C++ for its compile-time polymorphism and objectoriented features, the authors believe it is important to have HPCG be a C++ code.
  • B. Timing and execution rate results are reported.

5. JUSTIFICATION FOR HPCG BENCHMARK

  • This is in contrast with many of their MPI-only applications today, and presents a big challenge to applications that must certify their computational results and debug in the presence of bitwise variability.
  • At the same time, previous efforts are not appropriate to leverage, nor do expected trends in algorithms suggest a better approach at this time.
  • As such, its scope is broader than what the authors propose here, but this benchmark does not address scalable distributed memory parallelism or nested parallelism.

7. SUMMARY AND CONCLUSIONS

  • The  High  Performance  Linpack  (HPL)  Benchmark  is  an  incredibly  successful  metric  for  the   high  performance  computing  community.
  • The  trends  it  exposes,  the  focused  optimization   efforts  it  inspires  and  the  publicity  it  brings  to  our  community  are  very  important.
  • HPCG  is  large  enough  to  be   mathematically  meaningful,  yet  small  enough  to  easily  understand  and  use.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

1
Toward a New Metric for Ranking High
Performance Computing Systems
1
June 10, 2013
Michael A. Heroux
Scalable Algorithm Department
Sandia National Laboratories
P.O. Box 5800
Albuquerque, New Mexico 87185-MS 1320
Jack Dongarra
Electrical Engineering and Computer Science Department
1122 Volunteer Blvd University of Tennessee
Knoxville, TN 37996-3450
Abstract
.
The High Performance Linpack (HPL), or Top 500, benchmark [1] is the most widely recognized
and discussed metric for ranking high performance computing systems. However, HPL is
increasingly unreliable as a true measure of system performance for a growing collection of
important science and engineering applications.
In this paper we describe a new high performance conjugate gradient (HPCG) benchmark.
HPCG is composed of computations and data access patterns more commonly found in
applications. Using HPCG we strive for a better correlation to real scientific application
performance and expect to drive computer system design and implementation in directions that
will better impact performance improvement.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1
Also released as: Sandia National Lab; SAND2013-4744

2
ACKNOWLEDGMENTS
The authors thank the Department of Energy National Nuclear Security Agency for funding
provided for this work.

3
1. INTRODUCTION
The High Performance Linpack (HPL) benchmark is the most widely recognized and discussed
metric for ranking high performance computing systems. When HPL gained prominence as a
performance metric in the early 1990s there was a strong correlation between its predictions of
system rankings and the ranking that full-scale applications would realize. Computer system
vendors pursued designs that would increase HPL performance, which would in turn improve
overall application performance.
Presently HPL remains tremendously valuable as a measure of historical trends, and as a stress
test, especially for leadership class systems that are pushing the boundaries of current
technology. Furthermore, HPL provides the HPC community with a valuable outreach tool,
understandable to the outside world. Anyone with an appreciation for computing is impressed
by the tremendous increases in performance that HPC systems have attained over the past several
decades.
At the same time HPL rankings of computer systems are no longer so strongly correlated to real
application performance, especially for the broad set of HPC applications governed by
differential equations, which tend to have much stronger needs for high bandwidth and low
latency, and tend to access data using irregular patterns. In fact, we have reached a point where
designing a system for good HPL performance can actually lead to design choices that are wrong
for the real application mix, or add unnecessary components or complexity to the system.
We expect the gap between HPL predictions and real application performance to increase in the
future. In fact, the fast track to a computer system with the potential to run HPL at 1 Exaflop is a
design that may be very unattractive for our real applications. Without some intervention, future
architectures targeted toward good HPL performance will not be a good match for our
applications. As a result, we seek a new metric that will have a stronger correlation to our
application base and will therefore drive system designers in directions that will enhance
application performance for a broader set of HPC applications.
2. WHY HPL HAS LOST RELEVANCE
!
HPL is a simple program that factors and solves a large dense system of linear equations using
Gaussian Elimination with partial pivoting. The dominant calculations in this algorithm are
dense matrix-matrix multiplication and related kernels, which we call Type 1 patterns. With
proper organization of the computation, data access is predominantly unit stride and is mostly
hidden by concurrently performing computation on previously retrieved data. This kind of
algorithm strongly favors computers with very high floating-point computation rates and
adequate streaming memory systems.
While Type 1 patterns are commonly found in real applications, additional computations and
access patterns are also very common. In particular, many important calculations, which we call
Type 2 patterns, have much lower computation-to-data-access ratios, access memory irregularly,
and have fine-grain recursive computations.

4
A system that is designed to execute both Type 1 and 2 patterns efficiently will generally run a
broad mix of applications well. However, HPL only stresses Type 1 patterns and, as a metric, is
incapable of measuring Type 2 patterns. With the emergence of accelerators, which are
extremely effective with Type 1 patterns relative to CPUs, but much less so with Type 2 patterns,
HPL results show a skewed picture relative to real application performance.
For example, the Titan system at Oak Ridge National Laboratory has 18,688 nodes, each with a
16-core, 32 GB AMD Opteron processor and a 6GB Nvidia K20 GPU[2]. Titan was the top-
ranked system in November 2012 using HPL. However, in obtaining the HPL result on Titan,
the Opteron processors played only a supporting role in the result. All floating-point
computation and all data were resident on the GPUs. In contrast, real applications, when initially
ported to Titan, will typically run solely on the CPUs and selectively off-load computations to
the GPU for acceleration.
Of course, one of the important research efforts in HPC today is to design applications such that
more computations are Type 1 patterns, and we will see progress in the coming years. At the
same time, most applications will always have some Type 2 patterns and our benchmarks must
reflect this reality. In fact, a system’s ability to effectively address Type 2 patterns is an
important indicator of system balance.
3. REQUIREMENTS
Any new metric we introduce must satisfy a number of requirements. Two overarching
requirements are:
1. Accurately predict system rankings for target suite of applications: The ranking of
computer systems using the new metric must correlate strongly to how our real
applications would rank these same systems.
2. Drive improvements to computer systems to benefit our applications: The metric should
be designed so that, as we try to optimize metric results for a particular platform, the
changes will also lead to better performance in our real applications. Furthermore,
computation of the metric should drive system reliability in ways that help our
applications.
We will perform thorough validation testing of any proposed benchmark against a suite of
applications on current high-end systems using techniques similar to those identified in the
Mantevo project [3]. We will furthermore specify restrictions on changes to the reference
version of the code to ensure that only changes that have relevance to our application base are
permitted.
4. A PRECONDITIONED CONJUGATE GRADIENT BENCHMARK
As the candidate for a new HPC metric, we consider the preconditioned conjugate gradient
(PCG) method with a local symmetric Gauss-Seidel preconditioner (see the primer in the
Appendix for more details about PCG).

5
The reference code will be implemented in C++
2
using MPI and OpenMP. It will do the
following:
1. Problem setup: Generate a synthetic symmetric positive definite (SPD) matrix A
(perhaps using several sparsity patterns to match the broad interests of our community)
using the compressed sparse row format, and a corresponding right-hand-side vector b,
and initial guess for x.
a. Linear system size is a parameter that can be chosen via a prescribed process that
assures realistic use of the machine resources.
b. The benchmarker can use a different matrix format and the setup cost in building
the new data structure is not counted in the benchmark timing, although the cost
will be reported, normalized by the cost of a matrix-vector multiplication
operation using the original data structures.
c. Although the matrix pattern may be regular, or nearly so, and value-symmetric,
matrix storage will be unstructured and keep a copy of all matrix values. The
benchmarker is prohibited from exploiting regularity by using, for example, a
sparse diagonal format and is prohibited from exploiting value symmetry to
reduce storage requirements.
2. Preconditioner setup: Set up data structures for the local symmetric Gauss-Seidel
preconditioner. The reference version will use simple compressed sparse row
representation for the lower and upper triangular matrices, each as a separate matrix.
a. The benchmarker is free to make the same transformations on these matrix objects
as in Step 1, again without counting this cost in the benchmark timing, but again
the setup time will be reported, normalized by the cost of one symmetric Gauss-
Seidel sweep using the original matrix format.
b. We may need to introduce a simple coarse grid solve as part of the preconditioner,
if the performance of a local triangular solve is not sufficiently representative of
our real codes.
3. Verification and validation setup: We will compute preconditions, post-conditions and
invariants that will aid in the detection of anomalies during the iteration phases.
a. We can compute spectral approximates that bound the error, and use other
properties of PCG and SPD matrices to verify and validate results.
b. We can compute comparison results with reference kernels to assure accurate
computation.
4. Iteration: We will perform m iterations, n times, using the same initial guess each time,
where m and n are sufficiently large to test system uptime. By doing this we can compare
the numerical results for “correctness” at the end of each m-iteration phase.
a. If the result is not bit-wise identical across successive m-iteration phases, we can
report the deviation. Acceptable deviations (as determined in the V&V setup)
will not invalidate the benchmark results. Instead they will alert the benchmarker
that bit-wise reproducibility has been lost.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2
Since many large-scale applications use C++ for its compile-time polymorphism and object-
oriented features, we believe it is important to have HPCG be a C++ code. Historically C++
compilers have not received sufficient attention in the early phases of new system development.
HPCG will provide incentive to re-prioritize efforts.

Citations
More filters
Proceedings ArticleDOI
08 Jun 2015
TL;DR: CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi, is proposed for real-world applications such as a solver with only tens of iterations because of its low-overhead for format conversion.
Abstract: Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV algorithm that is efficient both for regular matrices and for irregular matrices. Furthermore, we show that the overhead of the format conversion from the CSR to the CSR5 can be as low as the cost of a few SpMV operations. We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite. For the 14 regular matrices in the suite, we achieve comparable or better performance over the previous work. For the 10 irregular matrices, the CSR5 obtains average performance improvement of 17.6%, 28.5%, 173.0% and 293.3% (up to 213.3%, 153.6%, 405.1% and 943.3%) over the best existing work on dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion.

226 citations


Cites methods from "Toward a New Metric for Ranking Hig..."

  • ...he index of 31 or 63 bits is completely compatible to most numerical libraries such as Intel MKL. Moreover, reference implementation of the recent high performance conjugate gradient (HPCG) benchmark [15] also uses 32-bit signed integer for problem dimension no more than 231 and 64-bit signed integer for problem dimension larger than that. Therefore, it is safe to save 1 bit as the empty row hint and ...

    [...]

Posted Content
TL;DR: In this article, the authors proposed CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi.
Abstract: Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV algorithm that is efficient both for regular matrices and for irregular matrices. Furthermore, we show that the overhead of the format conversion from the CSR to the CSR5 can be as low as the cost of a few SpMV operations. We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite. For the 14 regular matrices in the suite, we achieve comparable or better performance over the previous work. For the 10 irregular matrices, the CSR5 obtains average performance improvement of 17.6\%, 28.5\%, 173.0\% and 293.3\% (up to 213.3\%, 153.6\%, 405.1\% and 943.3\%) over the best existing work on dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion. The source code of this work is downloadable at this https URL

148 citations

Proceedings ArticleDOI
16 Nov 2014
TL;DR: ACSR is presented, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups which have a similar number of non-zero elements, and thus avoids significant preprocessing overheads.
Abstract: Sparse matrix-vector multiplication (SpMV) is a widely used computational kernel. The most commonly used format for a sparse matrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations of SpMV that use the same matrix, it is not always feasible -- for instance when analyzing large dynamically evolving graphs. This paper presents ACSR, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups (bins) which have a similar number of non-zero elements. Further, for rows in bins that span a wide range of non zero counts, dynamic parallelism is leveraged. A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads. A CUDA implementation of ACSR is shown to outperform SpMV implementations in the NVIDIA CUSP and cuSPARSE libraries on a set of sparse matrices representing power-law graphs. We also demonstrate the use of ACSR for the analysis of dynamic graphs, where the improvement over extant approaches is even higher.

144 citations


Cites background from "Toward a New Metric for Ranking Hig..."

  • ...A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads....

    [...]

Proceedings ArticleDOI
05 Dec 2015
TL;DR: This work proposes an efficient hardware indirect memory prefetcher (IMP) to capture this access pattern and hide latency, and proposes a partial cacheline accessing mechanism for these prefetches to reduce the network and DRAM bandwidth pressure from the lack of spatial locality.
Abstract: Machine learning, graph analytics and sparse linear algebra-based applications are dominated by irregular memory accesses resulting from following edges in a graph or non-zero elements in a sparse matrix. These accesses have little temporal or spatial locality, and thus incur long memory stalls and large bandwidth requirements. A traditional streaming or striding prefetcher cannot capture these irregular access patterns. A majority of these irregular accesses come from indirect patterns of the form A[B[j]]. We propose an efficient hardware indirect memory prefetcher (IMP) to capture this access pattern and hide latency. We also propose a partial cacheline accessing mechanism for these prefetches to reduce the network and DRAM bandwidth pressure from the lack of spatial locality. Evaluated on 7 applications, IMP shows 56% speedup on average (up to 2.3×) compared to a baseline 64 core system with streaming prefetchers. This is within 23% of an idealized system. With partial cacheline accessing, we see another 9.4% speedup on average (up to 46.6%).

123 citations


Cites background or methods from "Toward a New Metric for Ranking Hig..."

  • ...SymGS: Symmetric Gauss-Seidel smoother (SymGS) is a key operation in the multigrid sparse 184 solver from HPCG [10]....

    [...]

  • ...Many important classes of algorithms, including machine learning problems (e.g., regression, classification using Support Vector Machines, and recommender systems), graph algorithms (e.g., the Graph500 benchmark and pagerank for computing ranks of webpages), as well as HPC applications (e.g., the HPCG benchmark) share similar computational and memory access patterns to that of Sparse Matrix Vector Multiplication, where the vector Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page....

    [...]

  • ...Our code is from the HPCG benchmark [10] and has been optimized for multicore processors [33]....

    [...]

  • ...For sparse linear algebra, we use HPCG [10], a newly introduced component of the Top500 supercomputer rankings....

    [...]

  • ...Operations on sparse data structures, such as sparse matrices, are important in a variety of emerging workloads in the areas of machine learning, graph operations and statistical analysis as well as sparse solvers used in High-Performance Computing [10, 42]....

    [...]

Journal ArticleDOI
01 Feb 2016
TL;DR: A new high-performance conjugate-gradient benchmark is described, composed of computations and data-access patterns commonly found in scientific applications, meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.
Abstract: We describe a new high-performance conjugate-gradient HPCG benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement.

117 citations


Cites background or methods from "Toward a New Metric for Ranking Hig..."

  • ...The HPC Challenge Benchmark Suite (Dongarra and Heroux, 2013, Luszczek et al., 2006, Luszczek and Dongarra, 2010) has established itself as a performance measurement framework with a comprehensive set of computational and, more importantly, memory-access patterns that build on the popularity and relevance of HPL but add a much richer view of benchmarked hardware....

    [...]

  • ...The HPC Challenge Benchmark Suite (Dongarra and Heroux, 2013, Luszczek et al., 2006, Luszczek and Dongarra, 2010) has established itself as a performance measurement framework with a comprehensive set of computational and, more importantly, memory-access patterns that build on the popularity and…...

    [...]

  • ...Keywords Preconditioned conjugate gradient, multigrid smoothing, additive Schwarz, HPC benchmarking, validation and verification...

    [...]

  • ...The HPCG benchmark (Dongarra and Heroux, 2013) is a tool for ranking computer systems based on a simple additive Schwarz, symmetric Gauss–Seidel preconditioned conjugate-gradient solver....

    [...]

References
More filters
Journal ArticleDOI
01 Sep 1991
TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Abstract: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters. These consist of five "parallel kernel" bench marks and three "simulated application" benchmarks. Together they mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications. The principal distinguishing feature of these benchmarks is their "pencil and paper" specification-all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional bench- marking approaches on highly parallel systems are avoided.

2,246 citations

Journal Article
TL;DR: The original NAS Parallel Benchmarks consisted of eight individual bench- mark problems, each of which focused on some aspect of scientific computing, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world computing applications.
Abstract: TITLE: The NAS Parallel Benchmarks AUTHOR: David H Bailey 1 ACRONYMS: NAS, NPB DEFINITION: The NAS Parallel Benchmarks (NPB) are a suite of parallel computer per- formance benchmarks. They were originally developed at the NASA Ames Re- search Center in 1991 to assess high-end parallel supercomputers [?]. Although they are no longer used as widely as they once were for comparing high-end sys- tem performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym “NAS” originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Sim- ulation Program, and more recently to the NASA Advanced Supercomputing Center, although the acronym remains “NAS.” The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan and Sisira Weeratunga. DISCUSSION: The original NAS Parallel Benchmarks consisted of eight individual bench- mark problems, each of which focused on some aspect of scientific computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientific computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late 1980s offered an attrac- tive alternative to parallel vector supercomputers that had been the mainstay of high-end scientific computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the sys- tems. As a result, it was difficult to discern whether the new systems offered any fundamental performance advantage over vector supercomputers, and, if so, which of the parallel offerings would be most useful in real-world scientific computation. 1 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, dhbailey@lbl.gov. Supported in part by the Director, Office of Computational and Technology Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy, under contract number DE-AC02-05CH11231.

875 citations

01 Nov 1997
TL;DR: This list lists the sites that have the 500 most powerful computer systems installed and the best Linpack benchmark performance achieved is used as a performance measure in ranking the computers.
Abstract: To provide a better basis for statistics on high-performance computers, we list the sites that have the 500 most powerful computer systems installed. The best Linpack benchmark performance achieved is used as a performance measure in ranking the computers.

785 citations

Journal ArticleDOI
TL;DR: A benchmark of iterative solvers for sparse matrices is presented and results on some high performance processors are given that show that performance is largely determined by memory bandwidth.
Abstract: We present a benchmark of iterative solvers for sparse matrices. The benchmark contains several common methods and data structures, chosen to be representative of the performance of a large class of methods in current use. We give results on some high performance processors that show that performance is largely determined by memory bandwidth.

13 citations


"Toward a New Metric for Ranking Hig..." refers background in this paper

  • ...Iterative Solver Benchmark: A lesser-known but more relevant benchmark, the Iterative Solver Benchmark [4] specifies the execution of a preconditioned CG and GMRES iteration using physically meaningful sparsity patterns and several preconditioners....

    [...]

  • ...Jack Dongarra, Victor Eijkhout, Henk van der Vorst, Iterative Solver Benchmark....

    [...]

Related Papers (5)
Frequently Asked Questions (8)
Q1. What have the authors contributed in "Toward a new metric for ranking high performance computing systems" ?

In this paper the authors describe a new high performance conjugate gradient ( HPCG ) benchmark. Using HPCG the authors strive for a better correlation to real scientific application performance and expect to drive computer system design and implementation in directions that will better impact performance improvement. 

Emerging asynchronous collectives and other latency-hiding techniques can be explored in the context of HPCG and aid in their adoption and optimization on future systems. 

Since many large-scale applications use C++ for its compile-time polymorphism and objectoriented features, the authors believe it is important to have HPCG be a C++ code. 

The benchmarker is prohibited from exploiting regularity by using, for example, a sparse diagonal format and is prohibited from exploiting value symmetry to reduce storage requirements. 

The authors can compute spectral approximates that bound the error, and use other properties of PCG and SPD matrices to verify and validate results. 

The High Performance Linpack (HPL) benchmark is the most widely recognized and discussed metric for ranking high performance computing systems. 

In particular, many important calculations, which the authors call Type 2 patterns, have much lower computation-to-data-access ratios, access memory irregularly, and have fine-grain recursive computations. 

The major communication (global and neighborhood collectives) and computational patterns (vector updates, dot products, sparse matrix-vector multiplications and local triangular solves) in their production differential equation codes, both implicit and explicit, are present in this benchmark.