scispace - formally typeset
Open AccessJournal ArticleDOI

Branch effect reduction techniques

Augustus K. Uht, +2 more
- 01 May 1997 - 
- Vol. 30, Iss: 5, pp 71-81
Reads0
Chats0
TLDR
A survey is presented which compares current branch effect reduction techniques, offering hope for greater gains, and is timely because research is bearing much fruit: speedups of 10 or more are being demonstrated in research simulations and may be realized in hardware within a few years.
Abstract
Branch effects are the biggest obstacle to gaining significant speedups when running general purpose code on instruction level parallel machines. The article presents a survey which compares current branch effect reduction techniques, offering hope for greater gains. We believe this survey is timely because research is bearing much fruit: speedups of 10 or more are being demonstrated in research simulations and may be realized in hardware within a few years. The hardware required for large scale exploitation is great, but the density of transistors per chip is increasing exponentially, with estimates of 50 to 100 million transistors per chip by the year 2000.

read more

Content maybe subject to copyright    Report

0018-9162/97/$10.00 © 1997 IEEE May 1997 71
Branch Effect
Reduction Techniques
T
here is an insatiable demand for computers of
ever-increasing performance. Old applications
are applied to more complex data and new appli-
cations demand improved capabilities. Developers
must exploit parallelism for all types of programs to
realize gains. Multiprocessor, multithreaded, vector,
and dataflow computers achieve speedups up to the
1,000’s for programs with large amounts of data par-
allelism or independent control flow. For general-pur-
pose code, however—which comprises most executed
code—parallel execution has been only two or three
times faster than sequential.
General-purpose code has many conditional
branches, irregular control flow, and much less data
parallelism. These code characteristics and their
detrimental consequences, in the form of
branch
effects, have severely limited the parallelism that can
be exploited. Branch effects result from the uncer-
tainties in the way branches execute.
In this article, we survey techniques to reduce
branch effects and describe their relative merits,
including examples from commercial machines. We
believe this survey is timely because research is bear-
ing much fruit: Speedups of 10 or more are being
demonstrated in research simulations and may be
realized in hardware within a few years. The hard-
ware required for large-scale exploitation is great,
but the density of transistors per chip is increasing
exponentially, with estimates of 50 to 100 million
transistors per chip by 2000.
PERFORMANCE FACTORS
Architectural enhancements alone account for
half the increase in processor performance over the
years—a percentage that is expected to stay the
same, if not grow. However, within the past five
years, single-instruction-issue pipelined processors
have topped out their performance, executing
slightly less than one instruction per cycle. If design-
ers are to continue increasing processor perfor-
mance, they must turn to methods that exploit
instruction-level parallelism within each program.
Superscalar processors like the Intel Pentium and
Motorola 68060 have been doing exactly this.
However, performance has stalled at speedups of
two to three instructions per cycle, on average. This
stagnation is due to branch effects.
Branch effects
To illustrate how branch effects can block the
exploitation of instruction-level parallelism, con-
sider the typical program, which has two kinds of
instructions: assignments
(A=B+C) and branches.
Branches are used to realize high-level control flow
statements such as
if (a<=b) {....}
or
for (i=1; i<=10; i++) {....}
In many cases, nominally sequential instructions,
such as A=B+C and D=E+F, are independent and
thus may be executed in parallel. The performance
improvement or speedup due to this parallelism is
the time to execute a program sequentially divided
by the time to execute the program in parallel. In a
program composed of the two instructions just
given, the speedup is 2 (2/1).
Branches give rise to control dependencies, a type
of branch effect. Classically, if some condition is
true, control transfers to the instruction at the
branch’s target address. The branch is then “taken,”
and its sign becomes T or 1. If the condition is false,
execution continues with the instruction immedi-
ately after the branch, in which case the sign is N
(“not taken”) or 0. The computer cannot execute
the code after a branch until it executes the branch
and updates the program counter. With this restric-
tion, parallelism can be exploited only from the
instructions occurring up to the next branch.
Because a branch path (code between executed
branches) is typically three to nine instructions, and
because data dependencies also restrict parallelism,
the speedup is only about 1.6.
1,2
The sidebar “How
Branch effects are the biggest obstacle to gaining significant speedups
when running general-purpose code on instruction-level parallel machines.
This survey compares current branch effect reduction techniques, offering
hope for greater gains.
Augustus K.
Uht
University of
Rhode Island
Vijay Sindagi
Texas
Instruments
Sajee
Somanathan
ADE Corp.
Research Feature
.

72 Computer
Dependencies Limit Instruction-Level Parallelism”
describes both data and control dependencies.
On the face of it, then, designers are stuck—they
cannot create processors that execute more than one
or two machine instructions per cycle. However, if
branch effects could be completely eliminated, per-
formance could improve 25 to 158 times over that
with sequential execution.
1,3
Branch effect reduction
Branch effect reduction techniques, or BERTs,
attempt to free instruction-level parallelism using the
mechanisms listed in Table 1. The table lists the tech-
niques we describe here.
As the table shows, a technique can use more than
one mechanism. Most work has gone into specula-
tive execution techniques, and they are consequently
more common in commercial machines.
Speculative execution. This mechanism condition-
ally executes code after a branch, even if the code is
dependent on the branch. Hence, execution is spec-
ulative because code is executed before the proces-
sor knows it should be executed.
Branch predictors, which attempt to predict the
branch sign, are key to most forms of speculative exe-
cution. The path predicted to be followed is the pre-
dicted path; the path predicted not to be followed is
the not-predicted path. The predicted path can be of
either branch sign (not-taken or taken). A technique
commonly predicts the branch path after the code
being executed enters the processor’s execution win-
dow but before the branch has resolved (before the
sign is actually known).
Most speculative execution methods are single-path
because they execute down one path from a branch.
When the processor encounters a branch, the tech-
nique predicts the branch sign, and execution proceeds
down the predicted path. However, because the
branch is unresolved, the processor performs all writes
to registers or memory and all I/O operations condi-
tionally, finalizing them only when it is certain that all
previously speculated branches have been predicted
correctly. If there is a misprediction before a condi-
tional operation, that operation is discarded. Hence,
the greater the distance between mispredictions, the
more parallelism can be extracted.
The accuracy of a technique’s prediction is expressed
as its branch prediction accuracy, the average fraction of
correct predictions. The amount of instruction-level par-
allelism a reduction technique can realize is extremely
sensitive to its branch prediction accuracy. For exam-
ple, improving branch prediction accuracy from just 85
percent to 90 percent increases the distance between mis-
predictions by 50 percent, as given by
distance
1/(1 accuracy)
Another important concept is the branch target
buffer—a form of cache commonly used to handle
branches through hardware. Typically, before a proces-
sor can execute a branch as taken, it must compute the
branch’s target address. This computation slows down
How Dependencies Limit
Instruction-Level Parallelism
Two instructions must be executed sequentially if there are
dependencies between them. A resource dependency arises if there
are insufficient resources, such as adders, to execute all possible
pending instructions simultaneously. Semantic dependencies
require instructions to execute sequentially to ensure correct pro-
gram results. Within this class are data and control (or procedural)
dependencies. Both consist of a set of classical dependency types
that restrict the available instruction-level parallelism. By deter-
mining a minimal set of these dependencies—a set that contains
only true dependencies—more parallelism can be made available.
Table A shows classical data dependencies. In each case, the
common use of memory or register variable A in instructions 1
and 2 creates the corresponding type of dependency. The set of
minimal data dependencies is composed of flow or true data
dependencies only. The other two types of data dependencies
can be eliminated with renaming. In renaming, multiple copies
of instruction sinks, such as A, are created. We assume that
renaming is used throughout this article.
Recent research is exploring the possibility of reducing the
effects of even true data dependencies using data prediction and
speculation. Results are still inconclusive, however.
Classically, all instructions after a branch must wait for the
branch to execute before they can execute. In the following
example, instructions 2 through 7 are control dependent on
instruction 1, a branch.
1: if ( a == b) { // [in branch format:
2: z = y + x; } // if (a != b) goto 3;]
3: d = e
**
f;
4: g = d
h;
5: if (x == y) { // [or :
6: u = y + e; } // if (x != y) goto 7;]
7: j = k
m;
With minimal control dependencies, the execution of instruc-
tions 3 through 7 does not depend on whether instruction 1 is
taken. Because instructions 3 through 7, including the branch at
5, can execute concurrently with instruction 1, more parallelism
is realized.
Table A. Classic data dependencies.
Dependency Alternate (hazard)
name name Example
Flow or True (read after write) 1. A = b + c
2. z = A y
Anti- (write after read) 1. z = A + c
2. A = y x
Output (write after write) 1. A = b + c
2. A = z y
.

the branch’s execution, but the target address is saved
in the branch target buffer. When the branch is exe-
cuted again, the availability of the target address elim-
inates the time penalty that would occur otherwise.
The buffer can also hold miscellaneous branch pre-
diction information, such as the predictor’s state.
Branch range reduction. This mechanism has two
approaches. One is to use the set of minimal control
dependencies. As the sidebar “How Dependencies Limit
Instruction-Level Parallelism” describes, the classical
model of control dependencies that all commercial and
most research processors use treats all dependencies as
true instead of recognizing the minimal set that are actu-
ally true. This is relatively inexpensive but misses sig-
nificant potential performance gains.
3
Another form of this mechanism is predication, in
which some assignment statements are executed only
if another input to the statement, a predicate, is true.
4
Block size increase. This mechanism increases the
distance between branches, thus increasing the size
of the average basic block and increasing the amount
of code available for parallelism. Techniques include
compiler-based methods, such as code percolation or
motion, or trace scheduling.
5
SPECULATIVE EXECUTION
Speculative execution can be realized in hardware
or software and can be used among processors as well
as within them. Although speculative execution most
often refers to single-path, eager execution and the
more recent disjoint eager execution (DEE) are also
possible.
6
Figure 1 illustrates their differences.
Typically one or two processing elements are
needed to execute the code in a branch path as con-
currently as possible. In the single-path strategy, these
resources are assigned linearly according to the num-
ber of branches pending. This strategy lowers hard-
ware cost, but the usefulness of increasing predictions
becomes negligible quite rapidly. The overall likeli-
hood or cumulative probability of execution of the
last branch path (at the tail of the tree) goes to zero,
making the added resources useless.
With the eager execution model, execution proceeds
down both paths of a branch, and no prediction is
made. When a branch resolves, all operations on the
not-taken paths are discarded. Consequently, eager
execution with unlimited resources (oracle execution),
would give the best performance, but it is hardly prac-
tical. With constrained resources, the eager execution
strategy does not perform very well.
1
Also, hardware
cost rises exponentially with each level of branches,
and it is hard to keep track of different sets of opera-
tions. For these reasons, the eager execution strategy
is seldom used, except for limited applications, such
as instruction fetch and decode in the Sun SuperSparc
and IBM 360/91.
May 1997 73
Table 1. BERT mechanisms and implementations.
Mechanisms used
Branch Block Commercial
Speculative range size implementation
Technique execution reduction increase examples
Eager execution IBM 360/91, Sun SuperSparc
Disjoint eager execution
alone
with minimal control dependencies (MCD)
Single path
No branch prediction Intel 8086
Static
Always not taken Intel i486
Always taken Sun SuperSparc
Backward Taken; Forward Not Taken (BTFN) HP PA-7x00
Semistatic (profiling) Early PowerPCs
Dynamic
1-bit DEC Alpha 21064, AMD-K5
2-bit NexGen 586, PowerPC 604, Cyrix 6x86,
Cyrix M2, Mips R10000
Two-level adaptive Intel Pentium Pro, AMD-K6
Selector DEC Alpha 21264
Hybrid
Multiscalar
Other BERTs
Minimal control dependencies
Predication alone Denelcor HEP
Predication with software Cydrome Cydra 5, Intel Pentium Pro
VLIW Multiflow Trace, Cydrome Cydra 5,
Intel/HP Merced (?)
.

74 Computer
The disjoint eager execution strategy performs bet-
ter than the other two strategies when resources are
limited. The idea is to assign resources to branch paths
whose results are most likely to be used; that is,
branch paths with the highest cumulative probabili-
ties of execution. Thus, all branches are predicted,
and some are eagerly executed. The hardware cost is
close to that of single-path, but performance is much
better. As the sidebar “Disjoint Eager Execution: A
Simulation Experiment” describes, speedups of 32 are
possible. Many instantiations of this strategy provide
variations in cost-accuracy trade-offs; we describe one
implementation in the sidebar.
Most speculative execution uses some form of
branch predictor. The latest ones are very accurate
but improve branch prediction accuracy by less than
a percent—an indication that branch prediction accu-
racy may be topping out. We describe the most com-
mon predictors here.
Static predictors
Static predictors operate by making hardwired pre-
dictions, typically that branches are executed as either
all not taken (Intel i486) or all taken (Sun SuperSparc).
These techniques cost practically nothing but have an
accuracy of only 40 to 60 percent. More involved but
still inexpensive methods also look at branch direc-
tion. BTFN (backward taken, forward not taken), for
example, predicts that all backward branches are
taken and all forward branches are not taken. Because
backward branches are taken typically 90 percent of
the time, BTFN improves branch prediction accuracy
to 65 percent. The HP PA-7x00 processors use this
strategy.
Semistatic predictors form a large class of static pre-
dictors. Again, predictions are constant over the pro-
gram’s execution. However, unlike other static
predictors, semistatic predictors vary across static
branches. And because the compiler makes these pre-
dictions, they are included in the machine instructions,
which means that if designers port this method to an
existing processor they must modify the processor’s
instruction set.
The compiler makes predictions using program pro-
le statistics, which it obtains by compiling the pro-
gram once and then running it on test data while
counting the times a branch is taken versus the times
it is not taken. The program is recompiled, using the
statistics to set the prediction bits in the object code’s
branches accordingly.
These predictors are limited because the statistics,
and hence predictions, can vary from the test data to
the actual data. Allowing predictions to vary from
branch to branch improves the prediction accuracies
of forward branches primarily; a typical forward
branch executes predominantly with one sign.
Therefore the branch prediction accuracy improves
to, on average, 75 percent. Many PowerPC proces-
sors use semistatic prediction.
Dynamic predictors
In dynamic prediction, predictions adapt to the
input data. A branch may execute consistently one
way in one part of the execution and the other way in
another part. A dynamic predictor can adapt to the
change and continue to make accurate predictions; a
semistatic predictor in a similar situation would give
wrong predictions much of the time. No profiling is
needed; dynamic prediction can be accomplished
entirely in hardware.
Dynamic predictors are typically 1-bit or 2-bit, so
named because of the storage needed to implement
them. The two-level adaptive predictor, a more recent
type, greatly increases the branch prediction accuracy
of the 2-bit predictor. The selector predictor allows
multiple predictors to be used together.
1-bit predictors. Figure 2a shows how a 1-bit predic-
tion algorithm uses state to predict that a branch will
execute next the same way. Nominally, there is a sepa-
rate automaton (state machine) for each static branch.
6
.12
5
.17
4
.24
3
.34
2
.49
1
.7
.05
.07
.10
.15
.21
.3
(a)
1 2
4 5
.7
.3
3
.49
6
.09.21.21
(b)
5
.24
3
.34
2
.49
1 4
6
.7
.10
.15
.21
.3
(c)
Figure 1. Speculative
execution strategies:
(a) single path, (b)
eager execution, (c)
disjoint eager execu-
tion. Each line
segment with an
arrow represents a
branch path.
Resources are fixed at
six branch paths. Bold
lines indicate the
code in the execution
window; resources
are assigned only to
bold lines (paths). All
branches are pending
(unresolved). Left-
pointing lines are pre-
dicted paths. Right-
pointing lines are
not-predicted paths.
Circled numbers indi-
cate the order of the
resource assignment.
Uncircled numbers
indicate the cumula-
tive probability that
the path will be exe-
cuted. For illustration,
branch prediction
accuracy is 70 percent
for all branches. The
disjoint eager execu-
tion strategy allocates
resources to more
likely paths than the
other strategies.
.

than the 1- or 2-bit predictors because the predic-
tor bases its predictions on specific branch histories,
not on a general averaging.
As Figure 3 shows, prediction involves two struc-
tures. The branch history register holds the branch
execution history. Each time a dynamic instance of
any branch resolves, its sign is shifted into the regis-
ter. The register helps prediction by capturing much
longer and more varied patterns of branch executions,
relative to a 2-bit predictor. The branch pattern table
contains a 2-bit counter automaton for each possible
pattern of the branch history register. Typically, a
processor uses one register and one table for all
branches.
The automata are accessed using the contents of the
branch history register as the table’s address (“index”
in the figure). As with the 1- and 2-bit predictors, the
state of the indexed automaton indicates the predic-
tion.
Using a single branch history register, the predictor
combines information from multiple branches, allow-
ing the correlation among different static branches
The state of the automaton becomes 1 if a branch is
actually taken and 0 if it is not. The new state indicates
the prediction for the next instance of the branch.
The automaton can be realized implicitly with a
branch target buffer. If the buffer contains an entry
for the branch, the branch was taken when last exe-
cuted, and the dynamic prediction algorithm predicts
that the same branch will be taken when next encoun-
tered. If there is no entry in the buffer, the branch was
not-taken when last executed, and the algorithm pre-
dicts it will be not taken again.
One-bit predictors have a branch prediction accu-
racy of 77 to 79 percent. The DEC Alpha 21064
processor uses this predictor, holding the state for up
to 2K automata.
2-bit predictors. Figure 2b shows the 2-bit saturat-
ing up/down counter developed by James Smith.
7
Performance is better (78 to 89 percent accuracies on
real machines), but the cost is higher.
Each 2-bit automaton’s state is stored in a branch
target buffer. A branch is predicted by reading the
buffer and using the state of the automata. Branches
that are more often taken are predicted taken; like-
wise for not-taken branches. In this way, the predic-
tions are based on averaging.
The 2-bit predictor is less affected by occasional
changes in branch sign than the 1-bit predictor. In the
branch execution stream N-N-N-T-N-N-N, the 1-bit
predictor gives two mispredictions; the 2-bit predictor,
only one. However, the 2-bit predictor can potentially
be wrong 100 percent of the time (if starting from state
01, every branch in T-N-T-N-T-N... would be mis-
predicted).
Recent microprocessors, such as the NexGen 586
(2K automata) and the Intel Pentium (256 automata)
use this predictor.
Two-level adaptive predictor. Researchers at the
University of Michigan
8
and later IBM and the
University of Texas
9
devised the two-level adaptive,
or branch correlation, predictor, which is signifi-
cantly more accurate (typically 93 percent accuracy)
May 1997 75
(11)
predict:
T
(10)
predict:
T
saturated unsaturated unsaturated saturated
(01)
predict:
N
(00)
predict:
N
N
NT
T
N
State type
T
N
T
(b)
(1)
predict:
T
(0)
predict:
N
N
T N
T
(a)
Figure 2. Simple
dynamic branch pre-
dictors, which predict
if a branch is taken or
not taken by looking at
the most significant
bit of the predictor’s
state. This bit gives
the sign of the branch:
1 is “predict T(aken)”;
0 is “predict N(ot
taken).” A state tran-
sition occurs when a
branch resolves, and
is determined by that
branch’s sign. (a)
One-bit branch predic-
tor and (b) 2-bit pre-
dictor.
Branch
history
register
Branch
pattern
table
predict (0):
Not taken
sign of
latest
resolved
branch
shift direction
index
1 0 1 1
1
0
msb lsb
0 1
0
0000
1011
1111
0
Figure 3. Two-level
adaptive branch pre-
dictor. Each row of
the branch pattern
table is the
equivalent of the 2-bit
counter in Figure 2b.
The branch history
register holds the
signs of past branch
executions. The pre-
dictor uses this recent
history to index to a
particular automaton
in the branch pattern
table.
.

Citations
More filters
Proceedings ArticleDOI

Accurate indirect branch prediction

TL;DR: This work investigates a wide range of two-level predictors dedicated exclusively to indirect branches, starting with predictors that use full-precision addresses and unlimited tables and progressively introducing hardware constraints and minimize the loss of predictor performance at each step.
Proceedings ArticleDOI

The cascaded predictor: economical and adaptive branch target prediction

TL;DR: Cascaded prediction dynamically classify and predict easily predicted branches using an inexpensive predictor, preventing insertion of these branches into a more powerful second stage predictor and obtains prediction rates equivalent to that of two-level predictors at approximately one fourth the cost.
Proceedings ArticleDOI

Virtual machine showdown: stack versus registers

TL;DR: This work extends existing work on comparing virtual stack and virtual register architectures in two ways, and presents an implementation of a register machine in a fully standard-compliant implementation of the Java VM.
Patent

Method and apparatus for annotating operands in a computer system with source instruction identifiers

TL;DR: In this paper, a method for forwarding operands directly between instructions operates in a computer central processing unit, where values for registers, condition codes, stack locations and memory storage locations are routed directly from the program instructions or microcode that alter them to the instructions that use those operands.
Journal Article

The Structure and Performance of Efficient Interpreters.

TL;DR: This work evaluates how accurate various existing and proposed branch prediction schemes are on a number of interpreters, how the mispredictions affect the performance of the interpreters and how two different interpreter implementation techniques perform with various branch predictors.
References
More filters
Proceedings ArticleDOI

Multiscalar processors

TL;DR: The philosophy of the multiscalar paradigm, the structure ofMultiscalar programs, and the hardware architecture of a multiscalars processor are presented.
Proceedings ArticleDOI

A study of branch prediction strategies

TL;DR: First, currently used techniques are discussed and analyzed using instruction trace data, and new techniques are proposed and are shown to provide greater accuracy and more flexibility at low cost.
Book

Limits of instruction-level parallelism

David W. Wall
TL;DR: In this paper, the authors present the results of simulations of 18 different test programs under 375 different models of available parallelism analysis, including branch prediction, register renaming and alias analysis.
Journal ArticleDOI

A VLIW architecture for a trace scheduling compiler

TL;DR: The TRACETM as mentioned in this paper is a very long instruction word (VLIW) compiler that computes ordinary sequential code into long instruction words, which is used in the Trace SchedulingTM compacting compiler.
Proceedings ArticleDOI

Two-level adaptive training branch prediction

TL;DR: The Two-Level Adaptive Training Branch Predictor (TATBP) as discussed by the authors is a branch prediction algorithm that uses run-time execution history to make predictions, which can reduce the number of pipeline hushes required.
Frequently Asked Questions (3)
Q1. What are the contributions in this paper?

In this article, the authors survey techniques to reduce branch effects and describe their relative merits, including examples from commercial machines. The authors believe this survey is timely because research is bearing much fruit: Speedups of 10 or more are being demonstrated in research simulations and may be realized in hardware within a few years. 

The text covers the five phases of software requirements engineering that need to be performed to reduce the chance of software failure: elicitation, analysis, specification, verification, and management. 

ISBN 0-8186-7532-2. Catalog # BP07532 — $35.00 Members / $42.00 ListSoftware Requirements Engineering Second Edition edited by Richard H. Thayer and Merlin Dorfman Foreword by Alan M. DavisThis new edition describes current best practices in requirements engineering with a focus primarily on soft-ware systems but also on systems that may contain other elements such as hardware or people. 

Trending Questions (1)
How many transistors are in the Apple m1 chip?

The hardware required for large scale exploitation is great, but the density of transistors per chip is increasing exponentially, with estimates of 50 to 100 million transistors per chip by the year 2000.