What are the future works in "Bounding pipeline and instruction cache performance" ?

The authors plan to automate the detection of many data dependencies using existing compiler optimization techniques to obtain tighter performance estimations [ 23 ]. The authors also plan to accurately calculate the number of iterations for loops which are dependent on the value of a loop counter variable of an outer loop. Due to the analysis of a function instance tree ( no recursion allowed ), addresses of run-time stack references can be statically determined even when the addresses may differ for different invocations of the same function. Compiler flow analysis can be used to detect the pattern of many calculated references, such as indexing through an array.

Why is it necessary to perform multiple pipeline analyses of a path?

Because of this dual caching behavior of a first miss instruction, it is necessary to perform more than one pipeline analysis of a path since the caching behavior of the instructions comprising the path can change between iterations.

What is the way to show that the caching effects are treated properly?

Since the pipeline effects of each of the paths within the loop are unioned, it only remains to be shown that the caching effects are treated properly.

What is the method used to adjust the time of the current construct?

the list of memory blocks known to be in cache after executing the other construct is used to adjust the time of the current construct by comparing this list to the list of first reference blocks in the current construct.

What is the timing analyzer's interpretation of the instruction fetch?

If an instruction is categorized as a first miss, then the timing analyzer will treat the instruction fetch as a miss if the program line has not yet been encountered as a first miss in the timing of the loop.

How many cycles will the timing analyzer compute for this loop?

the timing analyzer will compute a BCET of 13 + 9*(n − 1) cycles for this loop, where n is the minimum number of loop iterations.

What is the common method used to detect instruction fetches?

The Lim method would rarely detect instruction fetches that would always be misses until the surrounding constructs are analyzed, which is after the pipeline analysis of a construct has already occurred.

(Open Access) Bounding pipeline and instruction cache performance (1999) | Christopher Healy

Q: What have the authors contributed in "Bounding pipeline and instruction cache performance" ?

This paper describes an approach for bounding the worst and best-case performance of large code segments on machines that exploit both pipelining and instruction caching.

Bounding Pipeline and Instruction Cache Performance

Christopher A. Healy,Robert D. Arnold, Frank Mueller,David B. Whalley, Marion G. Harmon

†

Abstract

Predicting the execution time of code segments in real-time systems is challenging.Most recently designed machines con-

tain pipelines and caches. Pipeline hazards may result in multicycle delays. Instruction or data memory references may

not be found in cache and these misses typically requireseveral cycles to resolve.Whether an instruction will stall due to

apipeline hazardoracache miss depends on the dynamic sequence of previous instructions executed and memory refer-

ences performed. Furthermore, these penalties arenot independent since delays due to pipeline stalls and cache miss

penalties may overlap. This paper describes an approachfor bounding the worst and best-case performance of large

code segments on machines that exploit both pipelining and instruction caching.First, a method is used to analyze a pro-

gram’scontrol ﬂow to statically categorize the caching behavior of eachinstruction. Next, these categorizations areused

in the pipeline analysis of sequences of instructions representing paths within the program. A timing analyzer uses the

pipeline path analysis to estimate the worst and best-case execution performance of eachloop and function in the pro-

gram. Finally,agraphical user interface is invoked that allows a user to request timing predictions on portions of the

program. The results indicate that the timing analyzer efﬁciently produces tight predictions of worst and best-case perfor-

mance for pipelining and instruction caching.

Indexterms: real-time systems, worst-case execution time, best-case execution time, timing analysis, instruction cache,

pipelining

1. Introduction

Manyarchitectural features, such as pipelines and caches, present a dilemma for architects of real-

time systems. Use of these architectural features can result in signiﬁcant performance improvements.

In order to exploit these performance improvements in a real-time system, the WCET (Worst Case

Execution Time) must be predicted statically.Inaddition, sometimes the BCET (Best Case Execution

Time) is also needed. However, the aforementioned performance enhancing features introduce a

potentially high levelofunpredictability.Dependencies between instructions can cause pipeline haz-

ards that may delay the completion of instructions. While there has been much work accomplished

*This work was supported in part by the Ofﬁce of NavalResearch under contract number N00014-94-1-0006 and the National Science Founda-

tion under the cooperative agreement HRD-9707076. Preliminary versions of this work were described in the 1994 Real-Time Systems Symposium under

the title "Bounding Worst-Case Instruction Cache Performance" and the 1995 Real-Time Systems Symposium under the title "Integrating the Timing

Analysis of Pipelining and Instruction Caching."

†C. A. Healy and D. B. Whalleyare with the Department of Computer Science, Florida State University,Tallahassee, FL 32306-4530. R. D.

Arnold is with Peek Trafﬁc Systems Inc., 3000 Commonwealth Blvd., Tallahassee, FL 32303. F. Mueller is with the Institut fu

rInformatik,

Humboldt-Universita

tzuBerlin, Unter den Linden 6, D-10099 Berlin, Germany. M.G.Harmon is with the Department of Computer and Information

Systems, Florida A & M University,Tallahassee, FL 32307-3101. The authors can be contacted at either [{whalley,healy}@cs.fsu.edu, (850) 644-3506,

fax: -0058], [rarnold@transyt.peek-trafﬁc.com, (850) 562-2253, ext. 272, fax: -4126], [mueller@informatik.hu-berlin.de, (+49) (30) 20181-276, fax:

-280] or [harmon@cis.famu.edu, (850) 599-3042, fax: -3221].

-1-

on analyzing the execution performance of a sequence of instructions within a basic block, the analy-

sis of pipeline performance across basic blocks is more problematic. Instruction or data cache misses

further complicate the performance prediction problem since theyrequire several more cycles to

resolvethan cache hits. Predicting the caching behavior of an instruction is evenmore difﬁcult since

it may be affected by memory references that occurred long before the instruction was executed.

The timing analysis of these features is further exacerbated since pipelining and caching behavior

are not independent. Forinstance, consider the code segment and pipeline diagram in Figure 1 con-

sisting of three SPARC instructions. The pipeline cycles and stages represent the execution of these

instructions on a MicroSPARC I processor [1]. Each number within the pipeline diagram denotes that

the speciﬁed instruction is currently in the pipeline stage shown on the left and is in that stage during

the cycle indicated above.The ﬁrst instruction performs a ﬂoating-point addition and requires a total

of 20 cycles. Fetching the second instruction results in a cache miss, which is assumed to have a miss

penalty of nine additional cycles in this paper.The third instruction has a data dependencywith the

ﬁrst instruction and the execution of its MEM stage is delayed until the ﬂoating-point addition is

inst 1: faddd %f2,%f0,%f2

inst 2: sub %o4,%g1,%i2

inst 3: std %f2,[%o0+8]

SPARC Instructions

MEM

FEX

FWB

1 2 3 4 5

1 1

2 2 2 2

...

1 1 1

1 1 1 1

14 15 16 17 18 19 20

21 22

3 3

cycle

31stage

Pipeline Diagram

3 3 3 3 3

Figure 1. Example of Overlapping Pipeline Stages with a Cache Miss

-2-

completed.

The miss penalty associated with the access to main memory to fetch the second instruc-

tion is completely overlapped with the execution of the ﬂoating-point addition in the ﬁrst instruction.

If pipeline stalls and cache misses were treated independently,then the number of estimated cycles

associated with these instructions would be increased from 22 to 31 (i.e. by the cache miss penalty).

Unfortunately,the problem of overestimating WCET and underestimating BCET may become

more severe in the future. Cache miss penalties are increasing due to the growing gap between pro-

cessor and main memory speeds. Delays due to pipeline stalls become more likely with the introduc-

tion of superscalar and superpipelined architectures. Thus, naive timing analysis of programs on

machines with pipelines and caches will result in increased execution time prediction errors.

Let us deﬁne a task as the portion of code executed between twoscheduling points (context

switches) in a system with a non-preemptive scheduling paradigm. When a task starts execution, the

cache memory is assumed to be invalidated. During task execution, instructions are brought into

cache and often result in manyhits and misses that can be predicted statically.These caching predic-

tions can be integrated with pipeline analysis to estimate tight WCET and BCET bounds.

Figure 2 depicts an overviewofthe approach described in this paper for bounding the worst and

best-case performance of large code segments on machines with pipelines and instruction caches.

Control-ﬂowinformation, which could have been obtained by analyzing assembly or object ﬁles, is

stored as the side effect of the compilation. This information identiﬁes the loops that are in each

function, the basic blocks that comprise each loop, the instructions that reside in each basic block, and

the register operands associated with each instruction. The control-ﬂowinformation is passed to a

A std instruction has no write back stage since a store instruction only updates memory and not a register.The std instruction also requires

three cycles to complete the MEM stage on the MicroSPARC I.

-3-

Conﬁguration

Cache

Categorizations

Caching

Simulator

Cache

Static

Information

Flow

Control

Compiler

Files

Source

Instruction

Interface

User Timing

Predictions

Timing

Analyzer

Information

Machine

Dependent

Requests

Timing

User

Figure 2. OverviewofBounding Pipelining and Instruction Caching Performance.

static cache simulator.Itconstructs the control-ﬂowgraph of the program that consists of the call

graph and the control ﬂowofeach function. The program’scontrol-ﬂowgraph is then analyzed for a

givencache conﬁguration to produce a categorization of each instruction’spotential caching behavior.

The timing analyzer uses these categorizations to determine whether an instruction fetch should be

treated as a hit or a miss during the pipeline analysis. It also reads machine-dependent and control-

ﬂowinformation to determine howeach instruction proceeds through the pipeline. The timing ana-

lyzer produces a worst and best-case estimate of execution time for each loop and function within the

program. Finally,awindow-based interface is used to allowthe user to request the timing bounds for

portions of the program.

2. Instruction Caching Categorization

Static cache simulation

is used to statically categorize each instruction according to its caching

behavior using a speciﬁc cache conﬁguration in a givenprogram. The static simulation consists of

three phases. First, the control-ﬂowgraph of the entire program is constructed. This graph includes

the control-ﬂowinformation of each function and a function instance graph, which is simply a call

Static cache simulation is only brieﬂy introduced in this section. It is described in more detail elsewhere [2, 3, 4, 5, 6].

-4-

graph where each function instance is uniquely identiﬁed by the sequence of call sites required for its

invocation. Thus, adirected acyclic call graph (without recursion) is transformed into a tree of func-

tion instances.

Next, this program control-ﬂowgraph is analyzed to determine the program lines that may be in

cache at the entry and exit of each basic block within the program. The iterative algorithm in Figure

3isused to calculate an input and output cache state for each basic block in the function instance

graph. A cache state is simply the subset of all program lines that can potentially be cached at that

point in the control ﬂow. Initially,the top block’sinput state (the entry block of the main function) is

set to all invalid lines. The input state of a block is calculated by taking the union of the output states

of its immediate predecessors. The output state of a block is calculated by taking the union of its

input state and the program lines accessed by the block and subtracting the program lines with which

the block conﬂicts. The above steps are repeated until no more changes occur.

input_state(top) = all invalid lines

WHILE any change DO

FOR each basic block instance B DO

input_state(B) = NULL

FOR each immed pred P of B DO

input_state(B) += output_state(P)

output_state(B) =

(input_state(B) + prog_lines(B))

-conf_lines(B)

Figure 3. Algorithm to Calculate Cache States.

The input state for each basic block is used to categorize the caching behavior of each instruction

within the block. The categorization for each loop levelisobtained by examining the cache state for

that instruction with a mask representing the program lines that are accessed by the loop. An instruc-

tion’scaching behavior is assigned to one of four categories for each loop levelinwhich an instruc-

tion is contained. Note that each function is treated as a loop that executes for a single iteration. The

-5-

Bounding pipeline and instruction cache performance

Figures

Citations

The worst-case execution-time problem—overview of methods and survey of tools

Embedded System Design

The influence of processor architecture on the design and the results of WCET tools

Guest Editorial: A Review of Worst-Case Execution-TimeAnalysis

Real-Time Systems: Scheduling, Analysis, and Verification

References

Compilers: Principles, Techniques, and Tools

Calculating the maximum, execution time of real-time programs

Reasoning about time in higher-level language software

A case for direct-mapped caches

Predicting program execution times by analyzing static and dynamic program paths

Related Papers (5)

Calculating the maximum, execution time of real-time programs

Performance Analysis of Embedded Software Using Implicit Path Enumeration

An accurate worst case timing analysis for RISC processors

Predicting program execution times by analyzing static and dynamic program paths

Cache modeling for real-time software: beyond direct mapped instruction caches

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Bounding pipeline and instruction cache performance" ?

Q2. What are the future works in "Bounding pipeline and instruction cache performance" ?

Q3. Why is it necessary to perform multiple pipeline analyses of a path?

Q4. What is the way to show that the caching effects are treated properly?

Q5. What is the method used to adjust the time of the current construct?

Q6. What is the timing analyzer's interpretation of the instruction fetch?

Q7. How many cycles will the timing analyzer compute for this loop?

Q8. What is the common method used to detect instruction fetches?