Report•DOI•

Toward a New Metric for Ranking High Performance Computing Systems

Sandia Report, Jack Dongarra, Michael A. Heroux

01 Jun 2013-

TL;DR: A new high performance conjugate gradient (HPCG) benchmark is described, composed of computations and data access patterns more commonly found in applications that strive for a better correlation to real scientific application performance.

read less

Abstract: The High Performance Linpack (HPL), or Top 500, benchmark [1] is the most widely recognized and discussed metric for ranking high performance computing systems However, HPL is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications In this paper we describe a new high performance conjugate gradient (HPCG) benchmark HPCG is composed of computations and data access patterns more commonly found in applications Using HPCG we strive for a better correlation to real scientific application performance and expect to drive computer system design and implementation in directions that will better impact performance improvement

...read moreread less

Summary (1 min read)

Jump to: [1. INTRODUCTION] – [3. REQUIREMENTS] – [4. A PRECONDITIONED CONJUGATE GRADIENT BENCHMARK] – [5. JUSTIFICATION FOR HPCG BENCHMARK] and [7. SUMMARY AND CONCLUSIONS]

1. INTRODUCTION

The High Performance Linpack (HPL) benchmark is the most widely recognized and discussed metric for ranking high performance computing systems.
At the same time HPL rankings of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by differential equations, which tend to have much stronger needs for high bandwidth and low latency, and tend to access data using irregular patterns.
While Type 1 patterns are commonly found in real applications, additional computations and access patterns are also very common.

3. REQUIREMENTS

Any new metric the authors introduce must satisfy a number of requirements.
The ranking of computer systems using the new metric must correlate strongly to how their real applications would rank these same systems.
Drive improvements to computer systems to benefit their applications:.
The authors will perform thorough validation testing of any proposed benchmark against a suite of applications on current high-end systems using techniques similar to those identified in the Mantevo project [3].
The authors will furthermore specify restrictions on changes to the reference version of the code to ensure that only changes that have relevance to their application base are permitted.

4. A PRECONDITIONED CONJUGATE GRADIENT BENCHMARK

As the candidate for a new HPC metric, the authors consider the preconditioned conjugate gradient (PCG) method with a local symmetric Gauss-Seidel preconditioner .
Set up data structures for the local symmetric Gauss-Seidel preconditioner.
By doing this the authors can compare the numerical results for “correctness” at the end of each m-iteration phase.
Since many large-scale applications use C++ for its compile-time polymorphism and objectoriented features, the authors believe it is important to have HPCG be a C++ code.
B. Timing and execution rate results are reported.

5. JUSTIFICATION FOR HPCG BENCHMARK

This is in contrast with many of their MPI-only applications today, and presents a big challenge to applications that must certify their computational results and debug in the presence of bitwise variability.
At the same time, previous efforts are not appropriate to leverage, nor do expected trends in algorithms suggest a better approach at this time.
As such, its scope is broader than what the authors propose here, but this benchmark does not address scalable distributed memory parallelism or nested parallelism.

7. SUMMARY AND CONCLUSIONS

The High Performance Linpack (HPL) Benchmark is an incredibly successful metric for the high performance computing community.
The trends it exposes, the focused optimization efforts it inspires and the publicity it brings to our community are very important.
HPCG is large enough to be mathematically meaningful, yet small enough to easily understand and use.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

Toward a New Metric for Ranking High

Performance Computing Systems

June 10, 2013

Michael A. Heroux

Scalable Algorithm Department

Sandia National Laboratories

P.O. Box 5800

Albuquerque, New Mexico 87185-MS 1320

Jack Dongarra

Electrical Engineering and Computer Science Department

1122 Volunteer Blvd University of Tennessee

Knoxville, TN 37996-3450

Abstract

The High Performance Linpack (HPL), or Top 500, benchmark [1] is the most widely recognized

and discussed metric for ranking high performance computing systems. However, HPL is

increasingly unreliable as a true measure of system performance for a growing collection of

important science and engineering applications.

In this paper we describe a new high performance conjugate gradient (HPCG) benchmark.

HPCG is composed of computations and data access patterns more commonly found in

applications. Using HPCG we strive for a better correlation to real scientific application

performance and expect to drive computer system design and implementation in directions that

will better impact performance improvement.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Also released as: Sandia National Lab; SAND2013-4744

ACKNOWLEDGMENTS

The authors thank the Department of Energy National Nuclear Security Agency for funding

provided for this work.

1. INTRODUCTION

The High Performance Linpack (HPL) benchmark is the most widely recognized and discussed

metric for ranking high performance computing systems. When HPL gained prominence as a

performance metric in the early 1990s there was a strong correlation between its predictions of

system rankings and the ranking that full-scale applications would realize. Computer system

vendors pursued designs that would increase HPL performance, which would in turn improve

overall application performance.

Presently HPL remains tremendously valuable as a measure of historical trends, and as a stress

test, especially for leadership class systems that are pushing the boundaries of current

technology. Furthermore, HPL provides the HPC community with a valuable outreach tool,

understandable to the outside world. Anyone with an appreciation for computing is impressed

by the tremendous increases in performance that HPC systems have attained over the past several

decades.

At the same time HPL rankings of computer systems are no longer so strongly correlated to real

application performance, especially for the broad set of HPC applications governed by

differential equations, which tend to have much stronger needs for high bandwidth and low

latency, and tend to access data using irregular patterns. In fact, we have reached a point where

designing a system for good HPL performance can actually lead to design choices that are wrong

for the real application mix, or add unnecessary components or complexity to the system.

We expect the gap between HPL predictions and real application performance to increase in the

future. In fact, the fast track to a computer system with the potential to run HPL at 1 Exaflop is a

design that may be very unattractive for our real applications. Without some intervention, future

architectures targeted toward good HPL performance will not be a good match for our

applications. As a result, we seek a new metric that will have a stronger correlation to our

application base and will therefore drive system designers in directions that will enhance

application performance for a broader set of HPC applications.

2. WHY HPL HAS LOST RELEVANCE

HPL is a simple program that factors and solves a large dense system of linear equations using

Gaussian Elimination with partial pivoting. The dominant calculations in this algorithm are

dense matrix-matrix multiplication and related kernels, which we call Type 1 patterns. With

proper organization of the computation, data access is predominantly unit stride and is mostly

hidden by concurrently performing computation on previously retrieved data. This kind of

algorithm strongly favors computers with very high floating-point computation rates and

adequate streaming memory systems.

While Type 1 patterns are commonly found in real applications, additional computations and

access patterns are also very common. In particular, many important calculations, which we call

Type 2 patterns, have much lower computation-to-data-access ratios, access memory irregularly,

and have fine-grain recursive computations.

A system that is designed to execute both Type 1 and 2 patterns efficiently will generally run a

broad mix of applications well. However, HPL only stresses Type 1 patterns and, as a metric, is

incapable of measuring Type 2 patterns. With the emergence of accelerators, which are

extremely effective with Type 1 patterns relative to CPUs, but much less so with Type 2 patterns,

HPL results show a skewed picture relative to real application performance.

For example, the Titan system at Oak Ridge National Laboratory has 18,688 nodes, each with a

16-core, 32 GB AMD Opteron processor and a 6GB Nvidia K20 GPU[2]. Titan was the top-

ranked system in November 2012 using HPL. However, in obtaining the HPL result on Titan,

the Opteron processors played only a supporting role in the result. All floating-point

computation and all data were resident on the GPUs. In contrast, real applications, when initially

ported to Titan, will typically run solely on the CPUs and selectively off-load computations to

the GPU for acceleration.

Of course, one of the important research efforts in HPC today is to design applications such that

more computations are Type 1 patterns, and we will see progress in the coming years. At the

same time, most applications will always have some Type 2 patterns and our benchmarks must

reflect this reality. In fact, a system’s ability to effectively address Type 2 patterns is an

important indicator of system balance.

3. REQUIREMENTS

Any new metric we introduce must satisfy a number of requirements. Two overarching

requirements are:

1. Accurately predict system rankings for target suite of applications: The ranking of

computer systems using the new metric must correlate strongly to how our real

applications would rank these same systems.

2. Drive improvements to computer systems to benefit our applications: The metric should

be designed so that, as we try to optimize metric results for a particular platform, the

changes will also lead to better performance in our real applications. Furthermore,

computation of the metric should drive system reliability in ways that help our

applications.

We will perform thorough validation testing of any proposed benchmark against a suite of

applications on current high-end systems using techniques similar to those identified in the

Mantevo project [3]. We will furthermore specify restrictions on changes to the reference

version of the code to ensure that only changes that have relevance to our application base are

permitted.

4. A PRECONDITIONED CONJUGATE GRADIENT BENCHMARK

As the candidate for a new HPC metric, we consider the preconditioned conjugate gradient

(PCG) method with a local symmetric Gauss-Seidel preconditioner (see the primer in the

Appendix for more details about PCG).

The reference code will be implemented in C++

using MPI and OpenMP. It will do the

following:

1. Problem setup: Generate a synthetic symmetric positive definite (SPD) matrix A

(perhaps using several sparsity patterns to match the broad interests of our community)

using the compressed sparse row format, and a corresponding right-hand-side vector b,

and initial guess for x.

a. Linear system size is a parameter that can be chosen via a prescribed process that

assures realistic use of the machine resources.

b. The benchmarker can use a different matrix format and the setup cost in building

the new data structure is not counted in the benchmark timing, although the cost

will be reported, normalized by the cost of a matrix-vector multiplication

operation using the original data structures.

c. Although the matrix pattern may be regular, or nearly so, and value-symmetric,

matrix storage will be unstructured and keep a copy of all matrix values. The

benchmarker is prohibited from exploiting regularity by using, for example, a

sparse diagonal format and is prohibited from exploiting value symmetry to

reduce storage requirements.

2. Preconditioner setup: Set up data structures for the local symmetric Gauss-Seidel

preconditioner. The reference version will use simple compressed sparse row

representation for the lower and upper triangular matrices, each as a separate matrix.

a. The benchmarker is free to make the same transformations on these matrix objects

as in Step 1, again without counting this cost in the benchmark timing, but again

the setup time will be reported, normalized by the cost of one symmetric Gauss-

Seidel sweep using the original matrix format.

b. We may need to introduce a simple coarse grid solve as part of the preconditioner,

if the performance of a local triangular solve is not sufficiently representative of

our real codes.

3. Verification and validation setup: We will compute preconditions, post-conditions and

invariants that will aid in the detection of anomalies during the iteration phases.

a. We can compute spectral approximates that bound the error, and use other

properties of PCG and SPD matrices to verify and validate results.

b. We can compute comparison results with reference kernels to assure accurate

computation.

4. Iteration: We will perform m iterations, n times, using the same initial guess each time,

where m and n are sufficiently large to test system uptime. By doing this we can compare

the numerical results for “correctness” at the end of each m-iteration phase.

a. If the result is not bit-wise identical across successive m-iteration phases, we can

report the deviation. Acceptable deviations (as determined in the V&V setup)

will not invalidate the benchmark results. Instead they will alert the benchmarker

that bit-wise reproducibility has been lost.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Since many large-scale applications use C++ for its compile-time polymorphism and object-

oriented features, we believe it is important to have HPCG be a C++ code. Historically C++

compilers have not received sufficient attention in the early phases of new system development.

HPCG will provide incentive to re-prioritize efforts.

HTML Viewer

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Toward a new metric for ranking high performance computing systems" ?

In this paper the authors describe a new high performance conjugate gradient ( HPCG ) benchmark. Using HPCG the authors strive for a better correlation to real scientific application performance and expect to drive computer system design and implementation in directions that will better impact performance improvement.

Q2. What can be done to improve the performance of HPCG?

Emerging asynchronous collectives and other latency-hiding techniques can be explored in the context of HPCG and aid in their adoption and optimization on future systems.

Q3. Why is it important to have HPCG be a C++ code?

Since many large-scale applications use C++ for its compile-time polymorphism and objectoriented features, the authors believe it is important to have HPCG be a C++ code.

Q4. What is the restriction on the benchmarker?

The benchmarker is prohibited from exploiting regularity by using, for example, a sparse diagonal format and is prohibited from exploiting value symmetry to reduce storage requirements.

Q5. What can the authors do to verify and validate results?

The authors can compute spectral approximates that bound the error, and use other properties of PCG and SPD matrices to verify and validate results.

Q6. What is the widely recognized and discussed metric for ranking high performance computing systems?

The High Performance Linpack (HPL) benchmark is the most widely recognized and discussed metric for ranking high performance computing systems.

Q7. What are some of the characteristics of Type 2 patterns?

In particular, many important calculations, which the authors call Type 2 patterns, have much lower computation-to-data-access ratios, access memory irregularly, and have fine-grain recursive computations.

Q8. What are the main patterns in the HPCG benchmark?

The major communication (global and neighborhood collectives) and computational patterns (vector updates, dot products, sparse matrix-vector multiplications and local triangular solves) in their production differential equation codes, both implicit and explicit, are present in this benchmark.

Toward a New Metric for Ranking High Performance Computing Systems

Summary (1 min read)

1. INTRODUCTION

3. REQUIREMENTS

4. A PRECONDITIONED CONJUGATE GRADIENT BENCHMARK

5. JUSTIFICATION FOR HPCG BENCHMARK

7. SUMMARY AND CONCLUSIONS

Citations

Cites methods from "Toward a New Metric for Ranking Hig..."

Cites background from "Toward a New Metric for Ranking Hig..."

Cites background or methods from "Toward a New Metric for Ranking Hig..."

Cites background or methods from "Toward a New Metric for Ranking Hig..."

References

"Toward a New Metric for Ranking Hig..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Toward a new metric for ranking high performance computing systems" ?

Q2. What can be done to improve the performance of HPCG?

Q3. Why is it important to have HPCG be a C++ code?

Q4. What is the restriction on the benchmarker?

Q5. What can the authors do to verify and validate results?

Q6. What is the widely recognized and discussed metric for ranking high performance computing systems?

Q7. What are some of the characteristics of Type 2 patterns?

Q8. What are the main patterns in the HPCG benchmark?