What are the contributions in this paper?

In this article, the authors survey techniques to reduce branch effects and describe their relative merits, including examples from commercial machines. The authors believe this survey is timely because research is bearing much fruit: Speedups of 10 or more are being demonstrated in research simulations and may be realized in hardware within a few years.

What is the title of the book?

The text covers the five phases of software requirements engineering that need to be performed to reduce the chance of software failure: elicitation, analysis, specification, verification, and management.

What is the ISBN for this book?

ISBN 0-8186-7532-2. Catalog # BP07532 — $35.00 Members / $42.00 ListSoftware Requirements Engineering Second Edition edited by Richard H. Thayer and Merlin Dorfman Foreword by Alan M. DavisThis new edition describes current best practices in requirements engineering with a focus primarily on soft-ware systems but also on systems that may contain other elements such as hardware or people.

(Open Access) Branch effect reduction techniques (1997) | Augustus K. Uht

Branch Effect

Reduction Techniques

here is an insatiable demand for computers of

ever-increasing performance. Old applications

are applied to more complex data and new appli-

cations demand improved capabilities. Developers

must exploit parallelism for all types of programs to

realize gains. Multiprocessor, multithreaded, vector,

and dataﬂow computers achieve speedups up to the

1,000’s for programs with large amounts of data par-

allelism or independent control ﬂow. For general-pur-

pose code, however—which comprises most executed

code—parallel execution has been only two or three

times faster than sequential.

General-purpose code has many conditional

branches, irregular control ﬂow, and much less data

parallelism. These code characteristics and their

detrimental consequences, in the form of

branch

effects, have severely limited the parallelism that can

be exploited. Branch effects result from the uncer-

tainties in the way branches execute.

In this article, we survey techniques to reduce

branch effects and describe their relative merits,

including examples from commercial machines. We

believe this survey is timely because research is bear-

ing much fruit: Speedups of 10 or more are being

demonstrated in research simulations and may be

realized in hardware within a few years. The hard-

ware required for large-scale exploitation is great,

but the density of transistors per chip is increasing

exponentially, with estimates of 50 to 100 million

transistors per chip by 2000.

PERFORMANCE FACTORS

Architectural enhancements alone account for

half the increase in processor performance over the

years—a percentage that is expected to stay the

same, if not grow. However, within the past five

years, single-instruction-issue pipelined processors

have topped out their performance, executing

slightly less than one instruction per cycle. If design-

ers are to continue increasing processor perfor-

mance, they must turn to methods that exploit

instruction-level parallelism within each program.

Superscalar processors like the Intel Pentium and

Motorola 68060 have been doing exactly this.

However, performance has stalled at speedups of

two to three instructions per cycle, on average. This

stagnation is due to branch effects.

Branch effects

To illustrate how branch effects can block the

exploitation of instruction-level parallelism, con-

sider the typical program, which has two kinds of

instructions: assignments

(A=B+C) and branches.

Branches are used to realize high-level control ﬂow

statements such as

if (a<=b) {....}

for (i=1; i<=10; i++) {....}

In many cases, nominally sequential instructions,

such as A=B+C and D=E+F, are independent and

thus may be executed in parallel. The performance

improvement or speedup due to this parallelism is

the time to execute a program sequentially divided

by the time to execute the program in parallel. In a

program composed of the two instructions just

given, the speedup is 2 (2/1).

Branches give rise to control dependencies, a type

of branch effect. Classically, if some condition is

true, control transfers to the instruction at the

branch’s target address. The branch is then “taken,”

and its sign becomes T or 1. If the condition is false,

execution continues with the instruction immedi-

ately after the branch, in which case the sign is N

(“not taken”) or 0. The computer cannot execute

the code after a branch until it executes the branch

and updates the program counter. With this restric-

tion, parallelism can be exploited only from the

instructions occurring up to the next branch.

Because a branch path (code between executed

branches) is typically three to nine instructions, and

because data dependencies also restrict parallelism,

the speedup is only about 1.6.

1,2

The sidebar “How

Branch effects are the biggest obstacle to gaining signiﬁcant speedups

when running general-purpose code on instruction-level parallel machines.

This survey compares current branch effect reduction techniques, offering

hope for greater gains.

Augustus K.

Uht

University of

Rhode Island

Vijay Sindagi

Texas

Instruments

Sajee

Somanathan

ADE Corp.

Research Feature

72 Computer

Dependencies Limit Instruction-Level Parallelism”

describes both data and control dependencies.

On the face of it, then, designers are stuck—they

cannot create processors that execute more than one

or two machine instructions per cycle. However, if

branch effects could be completely eliminated, per-

formance could improve 25 to 158 times over that

with sequential execution.

1,3

Branch effect reduction

Branch effect reduction techniques, or BERTs,

attempt to free instruction-level parallelism using the

mechanisms listed in Table 1. The table lists the tech-

niques we describe here.

As the table shows, a technique can use more than

one mechanism. Most work has gone into specula-

tive execution techniques, and they are consequently

more common in commercial machines.

Speculative execution. This mechanism condition-

ally executes code after a branch, even if the code is

dependent on the branch. Hence, execution is spec-

ulative because code is executed before the proces-

sor knows it should be executed.

Branch predictors, which attempt to predict the

branch sign, are key to most forms of speculative exe-

cution. The path predicted to be followed is the pre-

dicted path; the path predicted not to be followed is

the not-predicted path. The predicted path can be of

either branch sign (not-taken or taken). A technique

commonly predicts the branch path after the code

being executed enters the processor’s execution win-

dow but before the branch has resolved (before the

sign is actually known).

Most speculative execution methods are single-path

because they execute down one path from a branch.

When the processor encounters a branch, the tech-

nique predicts the branch sign, and execution proceeds

down the predicted path. However, because the

branch is unresolved, the processor performs all writes

to registers or memory and all I/O operations condi-

tionally, ﬁnalizing them only when it is certain that all

previously speculated branches have been predicted

correctly. If there is a misprediction before a condi-

tional operation, that operation is discarded. Hence,

the greater the distance between mispredictions, the

more parallelism can be extracted.

The accuracy of a technique’s prediction is expressed

as its branch prediction accuracy, the average fraction of

correct predictions. The amount of instruction-level par-

allelism a reduction technique can realize is extremely

sensitive to its branch prediction accuracy. For exam-

ple, improving branch prediction accuracy from just 85

percent to 90 percent increases the distance between mis-

predictions by 50 percent, as given by

distance

∝ 1/(1 − accuracy)

Another important concept is the branch target

buffer—a form of cache commonly used to handle

branches through hardware. Typically, before a proces-

sor can execute a branch as taken, it must compute the

branch’s target address. This computation slows down

How Dependencies Limit

Instruction-Level Parallelism

Two instructions must be executed sequentially if there are

dependencies between them. A resource dependency arises if there

are insufﬁcient resources, such as adders, to execute all possible

pending instructions simultaneously. Semantic dependencies

require instructions to execute sequentially to ensure correct pro-

gram results. Within this class are data and control (or procedural)

dependencies. Both consist of a set of classical dependency types

that restrict the available instruction-level parallelism. By deter-

mining a minimal set of these dependencies—a set that contains

only true dependencies—more parallelism can be made available.

Table A shows classical data dependencies. In each case, the

common use of memory or register variable A in instructions 1

and 2 creates the corresponding type of dependency. The set of

minimal data dependencies is composed of flow or true data

dependencies only. The other two types of data dependencies

can be eliminated with renaming. In renaming, multiple copies

of instruction sinks, such as A, are created. We assume that

renaming is used throughout this article.

Recent research is exploring the possibility of reducing the

effects of even true data dependencies using data prediction and

speculation. Results are still inconclusive, however.

Classically, all instructions after a branch must wait for the

branch to execute before they can execute. In the following

example, instructions 2 through 7 are control dependent on

instruction 1, a branch.

1: if ( a == b) { // [in branch format:

2: z = y + x; } // if (a != b) goto 3;]

3: d = e

4: g = d

––

5: if (x == y) { // [or :

6: u = y + e; } // if (x != y) goto 7;]

7: j = k

––

With minimal control dependencies, the execution of instruc-

tions 3 through 7 does not depend on whether instruction 1 is

taken. Because instructions 3 through 7, including the branch at

5, can execute concurrently with instruction 1, more parallelism

is realized.

Table A. Classic data dependencies.

Dependency Alternate (hazard)

name name Example

Flow or True (read after write) 1. A = b + c

2. z = A ∗ y

Anti- (write after read) 1. z = A + c

2. A = y ∗ x

Output (write after write) 1. A = b + c

2. A = z ∗ y

the branch’s execution, but the target address is saved

in the branch target buffer. When the branch is exe-

cuted again, the availability of the target address elim-

inates the time penalty that would occur otherwise.

The buffer can also hold miscellaneous branch pre-

diction information, such as the predictor’s state.

Branch range reduction. This mechanism has two

approaches. One is to use the set of minimal control

dependencies. As the sidebar “How Dependencies Limit

Instruction-Level Parallelism” describes, the classical

model of control dependencies that all commercial and

most research processors use treats all dependencies as

true instead of recognizing the minimal set that are actu-

ally true. This is relatively inexpensive but misses sig-

niﬁcant potential performance gains.

Another form of this mechanism is predication, in

which some assignment statements are executed only

if another input to the statement, a predicate, is true.

Block size increase. This mechanism increases the

distance between branches, thus increasing the size

of the average basic block and increasing the amount

of code available for parallelism. Techniques include

compiler-based methods, such as code percolation or

motion, or trace scheduling.

SPECULATIVE EXECUTION

Speculative execution can be realized in hardware

or software and can be used among processors as well

as within them. Although speculative execution most

often refers to single-path, eager execution and the

more recent disjoint eager execution (DEE) are also

possible.

Figure 1 illustrates their differences.

Typically one or two processing elements are

needed to execute the code in a branch path as con-

currently as possible. In the single-path strategy, these

resources are assigned linearly according to the num-

ber of branches pending. This strategy lowers hard-

ware cost, but the usefulness of increasing predictions

becomes negligible quite rapidly. The overall likeli-

hood or cumulative probability of execution of the

last branch path (at the tail of the tree) goes to zero,

making the added resources useless.

With the eager execution model, execution proceeds

down both paths of a branch, and no prediction is

made. When a branch resolves, all operations on the

not-taken paths are discarded. Consequently, eager

execution with unlimited resources (oracle execution),

would give the best performance, but it is hardly prac-

tical. With constrained resources, the eager execution

strategy does not perform very well.

Also, hardware

cost rises exponentially with each level of branches,

and it is hard to keep track of different sets of opera-

tions. For these reasons, the eager execution strategy

is seldom used, except for limited applications, such

as instruction fetch and decode in the Sun SuperSparc

and IBM 360/91.

May 1997 73

Table 1. BERT mechanisms and implementations.

Mechanisms used

Branch Block Commercial

Speculative range size implementation

Technique execution reduction increase examples

Eager execution ✔ IBM 360/91, Sun SuperSparc

Disjoint eager execution

alone ✔ —

with minimal control dependencies (MCD) ✔ ✔ —

Single path

No branch prediction Intel 8086

Static

Always not taken ✔ Intel i486

Always taken ✔ Sun SuperSparc

Backward Taken; Forward Not Taken (BTFN) ✔ HP PA-7x00

Semistatic (profiling) ✔ Early PowerPCs

Dynamic

1-bit ✔ DEC Alpha 21064, AMD-K5

2-bit ✔ NexGen 586, PowerPC 604, Cyrix 6x86,

Cyrix M2, Mips R10000

Two-level adaptive ✔ Intel Pentium Pro, AMD-K6

Selector ✔ DEC Alpha 21264

Hybrid ✔ —

Multiscalar ✔ ✔ —

Other BERTs

Minimal control dependencies ✔ —

Predication alone ✔ ✔ Denelcor HEP

Predication with software ✔ ✔ ✔ Cydrome Cydra 5, Intel Pentium Pro

VLIW ✔ ✔ ✔ Multiflow Trace, Cydrome Cydra 5,

Intel/HP Merced (?)

74 Computer

The disjoint eager execution strategy performs bet-

ter than the other two strategies when resources are

limited. The idea is to assign resources to branch paths

whose results are most likely to be used; that is,

branch paths with the highest cumulative probabili-

ties of execution. Thus, all branches are predicted,

and some are eagerly executed. The hardware cost is

close to that of single-path, but performance is much

better. As the sidebar “Disjoint Eager Execution: A

Simulation Experiment” describes, speedups of 32 are

possible. Many instantiations of this strategy provide

variations in cost-accuracy trade-offs; we describe one

implementation in the sidebar.

Most speculative execution uses some form of

branch predictor. The latest ones are very accurate

but improve branch prediction accuracy by less than

a percent—an indication that branch prediction accu-

racy may be topping out. We describe the most com-

mon predictors here.

Static predictors

Static predictors operate by making hardwired pre-

dictions, typically that branches are executed as either

all not taken (Intel i486) or all taken (Sun SuperSparc).

These techniques cost practically nothing but have an

accuracy of only 40 to 60 percent. More involved but

still inexpensive methods also look at branch direc-

tion. BTFN (backward taken, forward not taken), for

example, predicts that all backward branches are

taken and all forward branches are not taken. Because

backward branches are taken typically 90 percent of

the time, BTFN improves branch prediction accuracy

to 65 percent. The HP PA-7x00 processors use this

strategy.

Semistatic predictors form a large class of static pre-

dictors. Again, predictions are constant over the pro-

gram’s execution. However, unlike other static

predictors, semistatic predictors vary across static

branches. And because the compiler makes these pre-

dictions, they are included in the machine instructions,

which means that if designers port this method to an

existing processor they must modify the processor’s

instruction set.

The compiler makes predictions using program pro-

ﬁle statistics, which it obtains by compiling the pro-

gram once and then running it on test data while

counting the times a branch is taken versus the times

it is not taken. The program is recompiled, using the

statistics to set the prediction bits in the object code’s

branches accordingly.

These predictors are limited because the statistics,

and hence predictions, can vary from the test data to

the actual data. Allowing predictions to vary from

branch to branch improves the prediction accuracies

of forward branches primarily; a typical forward

branch executes predominantly with one sign.

Therefore the branch prediction accuracy improves

to, on average, 75 percent. Many PowerPC proces-

sors use semistatic prediction.

Dynamic predictors

In dynamic prediction, predictions adapt to the

input data. A branch may execute consistently one

way in one part of the execution and the other way in

another part. A dynamic predictor can adapt to the

change and continue to make accurate predictions; a

semistatic predictor in a similar situation would give

wrong predictions much of the time. No proﬁling is

needed; dynamic prediction can be accomplished

entirely in hardware.

Dynamic predictors are typically 1-bit or 2-bit, so

named because of the storage needed to implement

them. The two-level adaptive predictor, a more recent

type, greatly increases the branch prediction accuracy

of the 2-bit predictor. The selector predictor allows

multiple predictors to be used together.

1-bit predictors. Figure 2a shows how a 1-bit predic-

tion algorithm uses state to predict that a branch will

execute next the same way. Nominally, there is a sepa-

rate automaton (state machine) for each static branch.

.12

.17

.24

.34

.49

.05

.07

.10

.15

.21

(a)

1 2

4 5

.49

.09.21.21

(b)

.24

.34

.49

1 4

.10

.15

.21

(c)

Figure 1. Speculative

execution strategies:

(a) single path, (b)

eager execution, (c)

disjoint eager execu-

tion. Each line

segment with an

arrow represents a

branch path.

Resources are ﬁxed at

six branch paths. Bold

lines indicate the

code in the execution

window; resources

are assigned only to

bold lines (paths). All

branches are pending

(unresolved). Left-

pointing lines are pre-

dicted paths. Right-

pointing lines are

not-predicted paths.

Circled numbers indi-

cate the order of the

resource assignment.

Uncircled numbers

indicate the cumula-

tive probability that

the path will be exe-

cuted. For illustration,

branch prediction

accuracy is 70 percent

for all branches. The

disjoint eager execu-

tion strategy allocates

resources to more

likely paths than the

other strategies.

than the 1- or 2-bit predictors because the predic-

tor bases its predictions on specific branch histories,

not on a general averaging.

As Figure 3 shows, prediction involves two struc-

tures. The branch history register holds the branch

execution history. Each time a dynamic instance of

any branch resolves, its sign is shifted into the regis-

ter. The register helps prediction by capturing much

longer and more varied patterns of branch executions,

relative to a 2-bit predictor. The branch pattern table

contains a 2-bit counter automaton for each possible

pattern of the branch history register. Typically, a

processor uses one register and one table for all

branches.

The automata are accessed using the contents of the

branch history register as the table’s address (“index”

in the ﬁgure). As with the 1- and 2-bit predictors, the

state of the indexed automaton indicates the predic-

tion.

Using a single branch history register, the predictor

combines information from multiple branches, allow-

ing the correlation among different static branches

The state of the automaton becomes 1 if a branch is

actually taken and 0 if it is not. The new state indicates

the prediction for the next instance of the branch.

The automaton can be realized implicitly with a

branch target buffer. If the buffer contains an entry

for the branch, the branch was taken when last exe-

cuted, and the dynamic prediction algorithm predicts

that the same branch will be taken when next encoun-

tered. If there is no entry in the buffer, the branch was

not-taken when last executed, and the algorithm pre-

dicts it will be not taken again.

One-bit predictors have a branch prediction accu-

racy of 77 to 79 percent. The DEC Alpha 21064

processor uses this predictor, holding the state for up

to 2K automata.

2-bit predictors. Figure 2b shows the 2-bit saturat-

ing up/down counter developed by James Smith.

Performance is better (78 to 89 percent accuracies on

real machines), but the cost is higher.

Each 2-bit automaton’s state is stored in a branch

target buffer. A branch is predicted by reading the

buffer and using the state of the automata. Branches

that are more often taken are predicted taken; like-

wise for not-taken branches. In this way, the predic-

tions are based on averaging.

The 2-bit predictor is less affected by occasional

changes in branch sign than the 1-bit predictor. In the

branch execution stream N-N-N-T-N-N-N, the 1-bit

predictor gives two mispredictions; the 2-bit predictor,

only one. However, the 2-bit predictor can potentially

be wrong 100 percent of the time (if starting from state

01, every branch in T-N-T-N-T-N... would be mis-

predicted).

Recent microprocessors, such as the NexGen 586

(2K automata) and the Intel Pentium (256 automata)

use this predictor.

Two-level adaptive predictor. Researchers at the

University of Michigan

and later IBM and the

University of Texas

devised the two-level adaptive,

or branch correlation, predictor, which is signifi-

cantly more accurate (typically 93 percent accuracy)

May 1997 75

(11)

predict:

(10)

predict:

saturated unsaturated unsaturated saturated

(01)

predict:

(00)

predict:

State type

(b)

(1)

predict:

(0)

predict:

T N

(a)

Figure 2. Simple

dynamic branch pre-

dictors, which predict

if a branch is taken or

not taken by looking at

the most signiﬁcant

bit of the predictor’s

state. This bit gives

the sign of the branch:

1 is “predict T(aken)”;

0 is “predict N(ot

taken).” A state tran-

sition occurs when a

branch resolves, and

is determined by that

branch’s sign. (a)

One-bit branch predic-

tor and (b) 2-bit pre-

dictor.

Branch

history

Branch

pattern

table

predict (0):

Not taken

sign of

latest

resolved

branch

shift direction

index

1 0 1 1

msb lsb

0 1

0000

1011

1111

Figure 3. Two-level

adaptive branch pre-

dictor. Each row of

the branch pattern

table is the

equivalent of the 2-bit

counter in Figure 2b.

The branch history

signs of past branch

executions. The pre-

dictor uses this recent

history to index to a

particular automaton

in the branch pattern

table.

Branch effect reduction techniques

Figures

Citations

Accurate indirect branch prediction

The cascaded predictor: economical and adaptive branch target prediction

Virtual machine showdown: stack versus registers

Method and apparatus for annotating operands in a computer system with source instruction identifiers

The Structure and Performance of Efficient Interpreters.

References

Multiscalar processors

A study of branch prediction strategies

Limits of instruction-level parallelism

A VLIW architecture for a trace scheduling compiler

Two-level adaptive training branch prediction

Related Papers (5)

Branch Prediction Strategies and Branch Target Buffer Design

A comparison of dynamic branch predictors that use two levels of branch history

Two-level adaptive training branch prediction

A study of branch prediction strategies

Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches

Frequently Asked Questions (3)

Q1. What are the contributions in this paper?

Q2. What is the title of the book?

Q3. What is the ISBN for this book?

Trending Questions (1)