scispace - formally typeset
Open AccessJournal ArticleDOI

Bounding pipeline and instruction cache performance

TLDR
This paper describes an approach for bounding the worst and best case performance of large code segments on machines that exploit both pipelining and instruction caching and indicates that the timing analyzer efficiently produces tight predictions of best and best-case performance for pipelined and instruction cache.
Abstract
Predicting the execution time of code segments in real-time systems is challenging. Most recently designed machines contain pipelines and caches. Pipeline hazards may result in multicycle delays. Instruction or data memory references may not be found in cache and these misses typically require several cycles to resolve. Whether an instruction will stall due to a pipeline hazard or a cache miss depends on the dynamic sequence of previous instructions executed and memory references performed. Furthermore, these penalties are not independent since delays due to pipeline stalls and cache miss penalties may overlap. This paper describes an approach for bounding the worst and best case performance of large code segments on machines that exploit both pipelining and instruction caching. First, a method is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. Next, these categorizations are used in the pipeline analysis of sequences of instructions representing paths within the program. A timing analyzer uses the pipeline path analysis to estimate the worst and best-case execution performance of each loop and function in the program. Finally, a graphical user interface is invoked that allows a user to request timing predictions on portions of the program. The results indicate that the timing analyzer efficiently produces tight predictions of worst and best-case performance for pipelining and instruction caching.

read more

Content maybe subject to copyright    Report

Bounding Pipeline and Instruction Cache Performance
*
Christopher A. Healy,Robert D. Arnold, Frank Mueller,David B. Whalley, Marion G. Harmon
Abstract
Predicting the execution time of code segments in real-time systems is challenging.Most recently designed machines con-
tain pipelines and caches. Pipeline hazards may result in multicycle delays. Instruction or data memory references may
not be found in cache and these misses typically requireseveral cycles to resolve.Whether an instruction will stall due to
apipeline hazardoracache miss depends on the dynamic sequence of previous instructions executed and memory refer-
ences performed. Furthermore, these penalties arenot independent since delays due to pipeline stalls and cache miss
penalties may overlap. This paper describes an approachfor bounding the worst and best-case performance of large
code segments on machines that exploit both pipelining and instruction caching.First, a method is used to analyze a pro-
gram’scontrol flow to statically categorize the caching behavior of eachinstruction. Next, these categorizations areused
in the pipeline analysis of sequences of instructions representing paths within the program. A timing analyzer uses the
pipeline path analysis to estimate the worst and best-case execution performance of eachloop and function in the pro-
gram. Finally,agraphical user interface is invoked that allows a user to request timing predictions on portions of the
program. The results indicate that the timing analyzer efficiently produces tight predictions of worst and best-case perfor-
mance for pipelining and instruction caching.
Indexterms: real-time systems, worst-case execution time, best-case execution time, timing analysis, instruction cache,
pipelining
1. Introduction
Manyarchitectural features, such as pipelines and caches, present a dilemma for architects of real-
time systems. Use of these architectural features can result in significant performance improvements.
In order to exploit these performance improvements in a real-time system, the WCET (Worst Case
Execution Time) must be predicted statically.Inaddition, sometimes the BCET (Best Case Execution
Time) is also needed. However, the aforementioned performance enhancing features introduce a
potentially high levelofunpredictability.Dependencies between instructions can cause pipeline haz-
ards that may delay the completion of instructions. While there has been much work accomplished
*This work was supported in part by the Office of NavalResearch under contract number N00014-94-1-0006 and the National Science Founda-
tion under the cooperative agreement HRD-9707076. Preliminary versions of this work were described in the 1994 Real-Time Systems Symposium under
the title "Bounding Worst-Case Instruction Cache Performance" and the 1995 Real-Time Systems Symposium under the title "Integrating the Timing
Analysis of Pipelining and Instruction Caching."
†C. A. Healy and D. B. Whalleyare with the Department of Computer Science, Florida State University,Tallahassee, FL 32306-4530. R. D.
Arnold is with Peek Traffic Systems Inc., 3000 Commonwealth Blvd., Tallahassee, FL 32303. F. Mueller is with the Institut fu
..
rInformatik,
Humboldt-Universita
..
tzuBerlin, Unter den Linden 6, D-10099 Berlin, Germany. M.G.Harmon is with the Department of Computer and Information
Systems, Florida A & M University,Tallahassee, FL 32307-3101. The authors can be contacted at either [{whalley,healy}@cs.fsu.edu, (850) 644-3506,
fax: -0058], [rarnold@transyt.peek-traffic.com, (850) 562-2253, ext. 272, fax: -4126], [mueller@informatik.hu-berlin.de, (+49) (30) 20181-276, fax:
-280] or [harmon@cis.famu.edu, (850) 599-3042, fax: -3221].
-1-

on analyzing the execution performance of a sequence of instructions within a basic block, the analy-
sis of pipeline performance across basic blocks is more problematic. Instruction or data cache misses
further complicate the performance prediction problem since theyrequire several more cycles to
resolvethan cache hits. Predicting the caching behavior of an instruction is evenmore difficult since
it may be affected by memory references that occurred long before the instruction was executed.
The timing analysis of these features is further exacerbated since pipelining and caching behavior
are not independent. Forinstance, consider the code segment and pipeline diagram in Figure 1 con-
sisting of three SPARC instructions. The pipeline cycles and stages represent the execution of these
instructions on a MicroSPARC I processor [1]. Each number within the pipeline diagram denotes that
the specified instruction is currently in the pipeline stage shown on the left and is in that stage during
the cycle indicated above.The first instruction performs a floating-point addition and requires a total
of 20 cycles. Fetching the second instruction results in a cache miss, which is assumed to have a miss
penalty of nine additional cycles in this paper.The third instruction has a data dependencywith the
first instruction and the execution of its MEM stage is delayed until the floating-point addition is
inst 1: faddd %f2,%f0,%f2
inst 2: sub %o4,%g1,%i2
inst 3: std %f2,[%o0+8]
SPARC Instructions
EX
ID
IF
MEM
FEX
WB
FWB
1 2 3 4 5
1 1
2 2 2 2
...
...
...
11
2
12
2
1 1 1
13
1 1 1 1
14 15 16 17 18 19 20
1
21 22
3 3
cycle
1
2
1
2
2
1
3
1
3
31stage
Pipeline Diagram
3
3 3 3 3 3
Figure 1. Example of Overlapping Pipeline Stages with a Cache Miss
-2-

completed.
1
The miss penalty associated with the access to main memory to fetch the second instruc-
tion is completely overlapped with the execution of the floating-point addition in the first instruction.
If pipeline stalls and cache misses were treated independently,then the number of estimated cycles
associated with these instructions would be increased from 22 to 31 (i.e. by the cache miss penalty).
Unfortunately,the problem of overestimating WCET and underestimating BCET may become
more severe in the future. Cache miss penalties are increasing due to the growing gap between pro-
cessor and main memory speeds. Delays due to pipeline stalls become more likely with the introduc-
tion of superscalar and superpipelined architectures. Thus, naive timing analysis of programs on
machines with pipelines and caches will result in increased execution time prediction errors.
Let us define a task as the portion of code executed between twoscheduling points (context
switches) in a system with a non-preemptive scheduling paradigm. When a task starts execution, the
cache memory is assumed to be invalidated. During task execution, instructions are brought into
cache and often result in manyhits and misses that can be predicted statically.These caching predic-
tions can be integrated with pipeline analysis to estimate tight WCET and BCET bounds.
Figure 2 depicts an overviewofthe approach described in this paper for bounding the worst and
best-case performance of large code segments on machines with pipelines and instruction caches.
Control-flowinformation, which could have been obtained by analyzing assembly or object files, is
stored as the side effect of the compilation. This information identifies the loops that are in each
function, the basic blocks that comprise each loop, the instructions that reside in each basic block, and
the register operands associated with each instruction. The control-flowinformation is passed to a
1
A std instruction has no write back stage since a store instruction only updates memory and not a register.The std instruction also requires
three cycles to complete the MEM stage on the MicroSPARC I.
-3-

Configuration
Cache
Categorizations
Caching
Simulator
Cache
Static
Information
Flow
Control
Compiler
Files
Source
C
Instruction
Interface
User Timing
Predictions
Timing
Analyzer
Information
Machine
Dependent
Requests
Timing
User
Figure 2. OverviewofBounding Pipelining and Instruction Caching Performance.
static cache simulator.Itconstructs the control-flowgraph of the program that consists of the call
graph and the control flowofeach function. The program’scontrol-flowgraph is then analyzed for a
givencache configuration to produce a categorization of each instruction’spotential caching behavior.
The timing analyzer uses these categorizations to determine whether an instruction fetch should be
treated as a hit or a miss during the pipeline analysis. It also reads machine-dependent and control-
flowinformation to determine howeach instruction proceeds through the pipeline. The timing ana-
lyzer produces a worst and best-case estimate of execution time for each loop and function within the
program. Finally,awindow-based interface is used to allowthe user to request the timing bounds for
portions of the program.
2. Instruction Caching Categorization
Static cache simulation
2
is used to statically categorize each instruction according to its caching
behavior using a specific cache configuration in a givenprogram. The static simulation consists of
three phases. First, the control-flowgraph of the entire program is constructed. This graph includes
the control-flowinformation of each function and a function instance graph, which is simply a call
2
Static cache simulation is only briefly introduced in this section. It is described in more detail elsewhere [2, 3, 4, 5, 6].
-4-

graph where each function instance is uniquely identified by the sequence of call sites required for its
invocation. Thus, adirected acyclic call graph (without recursion) is transformed into a tree of func-
tion instances.
Next, this program control-flowgraph is analyzed to determine the program lines that may be in
cache at the entry and exit of each basic block within the program. The iterative algorithm in Figure
3isused to calculate an input and output cache state for each basic block in the function instance
graph. A cache state is simply the subset of all program lines that can potentially be cached at that
point in the control flow. Initially,the top block’sinput state (the entry block of the main function) is
set to all invalid lines. The input state of a block is calculated by taking the union of the output states
of its immediate predecessors. The output state of a block is calculated by taking the union of its
input state and the program lines accessed by the block and subtracting the program lines with which
the block conflicts. The above steps are repeated until no more changes occur.
input_state(top) = all invalid lines
WHILE any change DO
FOR each basic block instance B DO
input_state(B) = NULL
FOR each immed pred P of B DO
input_state(B) += output_state(P)
output_state(B) =
(input_state(B) + prog_lines(B))
-conf_lines(B)
Figure 3. Algorithm to Calculate Cache States.
The input state for each basic block is used to categorize the caching behavior of each instruction
within the block. The categorization for each loop levelisobtained by examining the cache state for
that instruction with a mask representing the program lines that are accessed by the loop. An instruc-
tion’scaching behavior is assigned to one of four categories for each loop levelinwhich an instruc-
tion is contained. Note that each function is treated as a loop that executes for a single iteration. The
-5-

Citations
More filters
Book

Embedded System Design

TL;DR: Embedded System Design can be used as a text book for courses on embedded systems and as a source which provides pointers to relevant material in the area for PhD students and teachers.
Journal ArticleDOI

The influence of processor architecture on the design and the results of WCET tools

TL;DR: The designs of WCET tools for a series of increasingly complex processors, including SuperSPARC, Motorola ColdFire 5307, and Motorola PowerPC 755, are described, and some advice is given as to the predictability of processor architectures.
Journal ArticleDOI

Guest Editorial: A Review of Worst-Case Execution-TimeAnalysis

TL;DR: The goal of this special issue is to review the achievements in WCET analysis and to report about the recent advances in this field.
Book

Real-Time Systems: Scheduling, Analysis, and Verification

TL;DR: Real-Time Systems: Scheduling, Analysis, and Verification provides a substantial, up-to-date overview of the verification and validation process of real-time systems.
References
More filters
Book

Compilers: Principles, Techniques, and Tools

TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.
Journal ArticleDOI

Calculating the maximum, execution time of real-time programs

TL;DR: The problems for the calculation of the maximum execution time (MAXT... MAximum eXecution Time) are discussed and the preconditions which have to be met before the MAXT of a task can be calculated are shown.
Journal ArticleDOI

Reasoning about time in higher-level language software

TL;DR: A methodology for specifying and providing assertions about time in higher-level-language programs is described, and examples of timing bounds and assertions that are proved include deadlines, timing invariants for periodic processes, and the specification of time-based events such as those needed for the recognition of single and double clicks from a mouse button.
Journal ArticleDOI

A case for direct-mapped caches

TL;DR: Direct-mapped caches are defined, and it is shown that trends toward larger cache sizes and faster hit times favor their use.
Journal ArticleDOI

Predicting program execution times by analyzing static and dynamic program paths

TL;DR: A formal path model for dynamic path analysis is introduced, where user execution information is represented by a set of program paths and a method to verify given user information with known program verification techniques is introduced.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What have the authors contributed in "Bounding pipeline and instruction cache performance" ?

This paper describes an approach for bounding the worst and best-case performance of large code segments on machines that exploit both pipelining and instruction caching. 

The authors plan to automate the detection of many data dependencies using existing compiler optimization techniques to obtain tighter performance estimations [ 23 ]. The authors also plan to accurately calculate the number of iterations for loops which are dependent on the value of a loop counter variable of an outer loop. Due to the analysis of a function instance tree ( no recursion allowed ), addresses of run-time stack references can be statically determined even when the addresses may differ for different invocations of the same function. Compiler flow analysis can be used to detect the pattern of many calculated references, such as indexing through an array. 

Because of this dual caching behavior of a first miss instruction, it is necessary to perform more than one pipeline analysis of a path since the caching behavior of the instructions comprising the path can change between iterations. 

Since the pipeline effects of each of the paths within the loop are unioned, it only remains to be shown that the caching effects are treated properly. 

the list of memory blocks known to be in cache after executing the other construct is used to adjust the time of the current construct by comparing this list to the list of first reference blocks in the current construct. 

If an instruction is categorized as a first miss, then the timing analyzer will treat the instruction fetch as a miss if the program line has not yet been encountered as a first miss in the timing of the loop. 

the timing analyzer will compute a BCET of 13 + 9*(n − 1) cycles for this loop, where n is the minimum number of loop iterations. 

The Lim method would rarely detect instruction fetches that would always be misses until the surrounding constructs are analyzed, which is after the pipeline analysis of a construct has already occurred.