scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Perceptron Based Consumer Prediction in Shared-Memory Multiprocessors

01 Oct 2006-pp 148-154
TL;DR: A perceptron consumer predictor is proposed that dynamically adapts its reaction to the system behavior, and uses more history information than previous consumer predictors, and outperforms the previous predictors by 21% while using only 1KByte more storage than previous predictor.
Abstract: Recent research has shown that forwarding speculative data to other processors before it is requested can improve the performance of multiprocessor systems. The most recent work in speculative data forwarding places all of the processors on a single bus, allowing the data to be forwarded to all of the processors at the same cost as any subset of the processors. Modern multiprocessors however often employ more complex switching networks in which broadcast is expensive. Accurately predicting the consumers of data can be challenging, especially in the case of programs with many shared data structures. Past consumer predictors rely on simple prediction mechanisms, a single table lookup followed by a static mapping of the table values onto a prediction. We make two main contributions in this paper. First, we show how to reduce the design space of consumer predictors to a set of interesting predictors, and how previous consumer predictors can be tuned to expand the range of available performance. Second, we propose a perceptron consumer predictor that dynamically adapts its reaction to the system behavior, and uses more history information than previous consumer predictors. This predictor outperforms the previous predictors by 21% while using only 1KByte more storage than previous predictors.

Summary (3 min read)

Introduction

  • Past consumer predictors rely on simple prediction mechanisms, a single table lookup followed by a static mapping of the table values onto a prediction.
  • First, the authors show how to reduce the design space of consumer predictors to a set of interesting predictors, and how previous consumer predictors can be tuned to expand the range of available performance.
  • The authors of [1] propose several predictors that are able to potentially skip a level of indirection in requests for access to a line by predicting the current sharers and sending requests to them in parallel to a request to the directory.
  • Other recent work highlights the possiblity of using speculation to simplify the design of multiprocessors [2] and to implement previously difficult-to-verify features in the coherence protocol [3].
  • This simplifies distribution of data, in that all transmissions are broadcasts on the bus, and thus reach all processors.

A. Consumer Predictors

  • Kaxiras and Young [5] provide a taxonomy of consumer predictors based on three parameters: (i) the indexing scheme of the history table, (ii) the depth of that table, and (iii) the function used to generate a prediction.
  • The history table contains entries, referenced by a combination of bits from the address of the block, and the PC of the instruction that wrote to that block.

OF CONSUMER PREDICTORS.

  • The number of bitmaps stored at each index is called the depth of the table.
  • Readers interested in the details of each specific permutation of indices are referred to [5].
  • Kaxiras and Young [5] also describe three functions to use in consumer set predictors.
  • Predict that the next sharer bitmap will be the union of those present in the history table, also known as (i) Union.
  • These are indexed and updated using the history of that specific processor.

B. Quantifying the Behavior of Consumer Set Predictors

  • Predictions can be sorted into (i) false positives (FP), (ii) false negatives (FN), (iii) true positives (TP), and (iv) true negatives (TN) depending on the prediction made (P/N), and its correctness (T/F).
  • It is important to note that in the case of consumer prediction the two false cases have different results.
  • When that processor needs information, it will send a request to the directory, as it would have without any form of consumer set prediction.
  • Both false positives and false negatives are mispredictions, but only one of the two will cause a performance penalty.
  • Ideally, a predictor would maximize the number of true positives, and minimize the number of false positives.

A. Evaluating Consumer Predictors

  • Previous work has chosen to focus on predictors that perform best in terms of either PVP or sensitivity.
  • Depending on the details of a multiprocessor system the potential penalties for transmitting unneeded data and the potential benefits of correctly forwarding data will differ widely.
  • Maximizing PVP is a more challenging problem.
  • The opposite of the perfectly sensitive predictor would never predict positive.
  • These are the predictors which are optimal for some trade off of sensitivity and PVP.

B. Why Perceptrons Work For Consumer Prediction

  • All of the previous techniques of consumer set prediction have one common limitation.
  • When “1”s appear in both the actual and predictor entries, a true positive has occured.
  • Thus intersection predicts that none of the processors will read the data.
  • On the bottom histogram this means that if processor P1 is a sharer now, some specific processor P2 will always be a sharer after the next invalidate.
  • In addition, perceptrons use information about the correlations between input and output to generate predictions, as shown in Figure 2.

D. Update Mechanism

  • There are two structures that need to be maintained for the perceptron to work, the history table, and the perceptron weight table.
  • The history table keeps depth bitmaps for each index.
  • A perceptron is updated when its output disagrees with the actual behavior of the system or if the magnitude of the sum was less than some threshold.
  • If data is distributed too early it may be requested by the original processor before another processor reads the data.
  • Second, data races could be introduced into the coherence protocol.

A. Methodology

  • The authors study evaluates a large number of different predictors, searching the design space across depth, index, and function.
  • The sharing patterns the authors study would be unchanged by implementing coherence decoupling, and so feedback of the predictor on the logical program execution can be ignored.
  • The authors gathered traces from the SPLASH-2 [10] benchmark suite using GEMS [11].
  • All the dynamic predictors were evaluated based upon predictions made as the results became known.
  • While the history tables proposed are quite large, recent research shows that it is possible to reduce this size substantially with little effect on performance [19].

B. Prediction Accuracy

  • As you can see, the perceptron completely dominates the Two-Level predictor, as well as the more sensitive intersection predictors and higher PVP union predictors.
  • None of the other predictors can offer as high a PVP as the intersection predictor.
  • The behavior of the perceptron predictor can be adjusted using its threshold.
  • Notice that in all cases the best predictors are located at the processors, and not at the directory.
  • Figure 8 shows the co-optimal set of predictors if the total size of a predictor is reduced below the larger predictors the authors used earlier.

V. CONCLUSIONS

  • The authors show how consumer predictors can be compared to each other without selecting specific interconnect details by comparing the chance of a predictor sending incorrect messages with its chance of missing opportunities.
  • This method produces a range of different consumer predictor options that a system designer could pick from depending on a specific overall implementation.
  • The authors develop a perceptron consumer-set predictor that requires little more space than previous predictors.
  • The authors perceptron predictor is able to achieve a tradeoff between the two, sensitivity between 0.5 and 0.65 and PVP between 0.6 and 0.8.
  • This range represents a previously unexplored tradeoff in consumer predictors.

Did you find this useful? Give us your feedback

Figures (10)

Content maybe subject to copyright    Report

Perceptron Based Consumer Prediction in
Shared-Memory Multiprocessors
Sean Leventhal and Manoj Franklin
School of Electrical and Computer Engineering
University of Maryland at College Park
{sleventh, manoj}@glue.umd.edu
Abstract Recent research has shown that forwarding specula-
tive data to other processors before it is requested can improve the
performance of multiprocessor systems. The most recent work
in speculative data forwarding places all of the processors on
a single bus, allowing the data to be forwarded to all of the
processors at the same cost as any subset of the processors.
Modern multiprocessors however often employ more complex
switching networks in which broadcast is expensive. Accurately
predicting the consumers of data can be challenging, especially
in the case of programs with many shared data structures.
Past consumer predictors rely on simple prediction mecha-
nisms, a single table lookup followed by a static mapping of the
table values onto a prediction. We make two main contributions
in this paper. First, we show how to reduce the design space
of consumer predictors to a set of interesting predictors, and
how previous consumer predictors can be tuned to expand the
range of available performance. Second, we propose a perceptron
consumer predictor that dynamically adapts its reaction to the
system behavior, and uses more history information than previous
consumer predictors. This predictor outperforms the previous
predictors by 21% while using only 1KByte more storage than
previous predictors.
I. INTRODUCTION
The increase in transistor count and decrease in hardware
cost over the last several years have caused multiproces-
sor systems to become more common. Entire multiprocessor
systems are now available to consumers on a single chip.
Traditionally, shared-memory multiprocessors specify the way
in which they communicate either over a bus, or a more
complicated network through a coherence protocol. This
coherence protocol is responsible for assuring that the memory
system behaves in a way that guarantees correct execution by
managing all communication between processors. A variety
of techniques use speculation, modifying these coherence
protocols in order to improve performance. For instance, the
authors of [1] propose several predictors that are able to
potentially skip a level of indirection in requests for access to
a line by predicting the current sharers and sending requests
to them in parallel to a request to the directory.
Other recent work highlights the possiblity of using spec-
ulation to simplify the design of multiprocessors [2] and to
implement previously difficult-to-verify features in the coher-
ence protocol [3]. Coherence decoupling [4] allows an out-
of-order core to execute speculatively based upon potentially
incoherent data. The authors of [4] provide two seperate
coherence decoupling schemes, one which seeks to eliminate
false sharing, and another which seeks to distribute data to
its consumers preemptively. The preemptive data distribution
assumes a single bus based architecture. This simplifies dis-
tribution of data, in that all transmissions are broadcasts on
the bus, and thus reach all processors. In order to extend
this system to an arbitrary network some form of consumer
prediction [5] would be needed
1
.
Methods similar to this are used in software [6] to identify
likely consumers using profiling and other compiler tech-
niques, and insert special instructions to forward data to them
at appropriate times. Consumer set prediction [5], [7] attempts
to identify the processors which will consume data. This
allows forwarding of data to its destination before a request
for the data is sent.
We propose that consumer set prediction should be com-
bined with coherence decoupling in order to implement an
update mechanism on an arbitrary topology efficiently. In this
paper we show how to use a perceptron to design a consumer
predictor that represents a unique tradeoff between bandwidth
usage and coverage. We show how to tune the behavior of a
perceptron predictor to acheive a wider range of tradeoffs be-
tween extra transmissions, and correct transmissions. Finally,
we show that a perceptron predictor is able to outperform
previous predictors by 21% when the goal is to achieve an
approximately even tradeoff between these two factors.
II. B
ACKGROUND
A. Consumer Predictors
Kaxiras and Young [5] provide a taxonomy of consumer
predictors based on three parameters: (i) the indexing scheme
of the history table, (ii) the depth of that table, and (iii)
the function used to generate a prediction. Using these a
predictor is represented in the form function(index)
depth
.
The history table contains entries, referenced by a combination
of bits from the address of the block, and the PC of the
instruction that wrote to that block. This table can be located
at each processor, at the directories, or in a global location.
In the given taxonomy this is represented by including bits
corresponding to the directory or processor in the indexing
scheme.
Entries in the history table are comprised of bitmaps, each
of which corresponds to a group of sharers between two sets of
invalidates. A bitmap contains a single bit for each processor in
1
Broadcasting to everyone in such a system would itself be a naive form
of consumer prediction.
1-4244-9707-X/06/$20.00 ©2006 IEEE

Prevalence
TP+FN
TP+TN+FP+FN
Sensitivity
TP
TP+FN
Predictive Value of a Positive Test (PVP)
TP
TP+FP
TABLE I
T
HE THREE TERMS WE USE IN THIS PAPER TO QUANTIFY THE BEHAVIOR
OF CONSUMER PREDICTORS
.
the system, set to one to indicate that a processor is a sharer
and zero to indicate that it is not a sharer. The number of
bitmaps stored at each index is called the depth of the table.
When an invalidate occurs, a new bitmap of sharers is created,
and one of the old bitmaps is deleted. Thus a Consumer Set
predictor function(pid+pc
4
)
2
indicates a predictor located at
each processor, indexed using 4 bits of the program counter,
with a depth of 2. Readers interested in the details of each
specific permutation of indices are referred to [5].
Kaxiras and Young [5] also describe three functions to use
in consumer set predictors.
(i) Union: Predict that the next sharer bitmap will be the
union of those present in the history table.
(ii) Intersection: Predict that the next sharer bitmap will
be the intersection of those present in the history table.
(iii) Two-Level PAs Prediction: Keep a set of two bit
up/down saturating counters for each potential consumer.
These are indexed and updated using the history of that
specific processor. Thus for N processors N 2
depth
counters are needed.
B. Quantifying the Behavior of Consumer Set Predictors
We use the following terminology (same as that proposed
in [5]) to describe the behavior of an individual predictor.
Predictions can be sorted into (i) false positives (FP), (ii)
false negatives (FN), (iii) true positives (TP), and (iv) true
negatives (TN) depending on the prediction made (P/N), and
its correctness (T/F). It is important to note that in the case of
consumer prediction the two false cases have different results.
A false positive incurs a penalty over normal execution. It
uses up bandwidth in transmitting extra data to no effect. A
false negative results in normal execution. When that processor
needs information, it will send a request to the directory, as
it would have without any form of consumer set prediction.
Both false positives and false negatives are mispredictions, but
only one of the two will cause a performance penalty.
Similarly, true negatives do not result in any benefit, while
true positives can yield an improvement in performance. Both
of these are correct predictions, but only one of the two is of
any value. Ideally, a predictor would maximize the number of
true positives, and minimize the number of false positives. To
quantify this behavior three terms are defined in Table I.
The prevalence, or frequency of positive cases, is a property
of the values being predicted and not the predictor itself. Thus,
we can reduce comparisons of consumer set predictors to two
terms: sensitivity (the number of potential positives that were
correctly predicted), and PVP (the reliability of a positive
prediction). Notice that when the number of true positives is
maximized the sensitivity will be one, and when the number
of false positives is zero the PVP will be one.
Fig. 1. Predictions made by previous functions with a depth of two on a
simple pattern. The predictions made by a two-level predictor would depend
on its depth. Depending on initialization conditions a two-level predictor with
a depth of two would make different predictions for the above example, but
all such cases will contain mispredictions.
III. PERCEPTRON CONSUMER PREDICTORS
A. Evaluating Consumer Predictors
Previous work has chosen to focus on predictors that
perform best in terms of either PVP or sensitivity. However,
depending on the details of a multiprocessor system the poten-
tial penalties for transmitting unneeded data and the potential
benefits of correctly forwarding data will differ widely. Each
system will represent a potentially unique trade off between
sensitivity and PVP. In fact, if the goal is to maximize
sensitivity the predictor design is trivial: simply predict that
every processor will be a consumer and the sensitivity will
be one. Maximizing PVP is a more challenging problem.
The opposite of the perfectly sensitive predictor would never
predict positive. However, when TP and FP are both zero PVP
is undefined. It is possible to bring the PVP of any predictor
closer to one using some form of confidence estimation.
Rather than assuming that these penalties and benefits are
in an extreme case, we leave decisions about this trade off to
those designing specific systems, and instead investigate the
set of co-optimal predictors. These are the predictors which
are optimal for some trade off of sensitivity and PVP.
B. Why Perceptrons Work For Consumer Prediction
All of the previous techniques of consumer set prediction
have one common limitation. In determining whether some
processor will be a sharer, they look only at the history of
that processor
2
. Making a prediction is simple, but potentially
useful information is thrown away.
Figure 1 shows an example of a simple sharing pattern for a
particular memory block. The vertical axis represents the dif-
ferent processors, and the horizontal axis represents different
epochs over time. The sharing pattern is shown in the portion
of each cell labeled Actual. A ”1” in a cell indicates that
the corresponding processor is a sharer of that memory block
during that epoch. In this pattern two processors have access
to a piece of data at any time. Each column corresponds to
the set of sharers for some interval of time, with invalidations
occuring between them. Each row corresponds to a single
2
Some coherence predictors, such as the one proposed in [15] [18] do look
at this information. But to our knowledge no such predictor has been directed
specifically at consumer prediction.

processor. Thus processors A and B have read permission on
the data, one writes and it is passed to processors C and D. E
and F receive the data next, followed by G and H. The pattern
then repeats.
The cells labeled Union and Intersection show the predic-
tions made by each of the two functions with a history of
depth two. The first two columns are not marked, as those
predictions will depend on the initial conditions. When “1”s
appear in both the actual and predictor entries, a true positive
has occured. When a “1” appears for a predictor and a “0”
appears for the actual result a false positive has occured, and
so on through all four cases. For instance, in the case of the
third set of sharers, the union predictor sees that in the last two
sets of sharers processors A through D had posession of the
data at some time. Union predicts that processors A through D
will want the data this time. Intersection on the other hand sees
that no processor had read permission to the data two times in
a row. Thus intersection predicts that none of the processors
will read the data.
Notice that the pattern is extremely simple and repetitive,
but the previously proposed predictors cannot identify it. In
fact, neither union nor intersection has a single true positive. It
is clear that taking additional information into account could
yield a better prediction. A reasonable question is whether
such behaviors occur in practice. Does the presence of a
processor in the set of sharers ever correspond to the presence
of a different processor in a previous set of sharers? We now
address this question by analyzing the amount of correlation
present across processor boundaries.
Figure 2 shows the amount of correlation that a perceptron
could exploit. In each group of sharers each processor has a
state, either present, or not present. The top histogram shows
the percentage of lines for which the state of a processor in
one group of sharers is correlated to the state of the same
processor in the next group of sharers. The bottom histogram
shows the percentage of lines for which the state of a processor
in one group of sharer is correlated to the state of a different
processor in the next group of sharers. A correlation of one
indicates that either the presence or absence of a processor
can be linked directly to the presence or absence of another
processor in the next set of sharers. On the top histogram
this means that if processor P
1
is a sharer now, it will be a
sharer after the next invalidate, and if it is not a sharer now
it will not be a sharer after the next invalidate. On the bottom
histogram this means that if processor P
1
is a sharer now,
some specific processor P
2
will always be a sharer after the
next invalidate. A correlation of negative one indicates that
the presence or absence of a processor can be linked to the
opposite behavior in the next set of sharers. Other correlations
indicate a relation between one event and the next that is not
absolute. A correlation of 0.9 would indicate that the vast
majority of the time (95%) the value at the next time is the
same as the value at this time. A correlation of 0.5 would
indicate that the next value was different 75% of the time. For
instance, in the top histogram this means that if processor P
1
is a sharer now, it will not be a sharer after it is invalidated;
and if processor P
1
is not a sharer now, it will be a sharer
after the next invalidate.
-1
-0.5
0
0.5
1
Correlation With Same Processor
0
10
20
30
40
50
Percent of Cases
-1
-0.5
0
0.5
1
Correlation With Other Processors
0
10
20
30
40
50
Percent of Cases
Fig. 2. Histogram of the correlations of a sharer’s presence in one iteration
based upon both its own, and other sharers’ presence in the previous iteration.
This data was collected from the SPLASH2 benchmark FMM on a 16
processor simulation, but is representative of the behavior seen in other
benchmarks.
As we can see, a processor’s own history is the single
biggest indicator in whether it will be a sharer in the next
group. However, in many cases there is a strong relationship
with other processors as well. In fact, in almost 30% of cases
the presence of a processor in a group of sharers can be
linked to another processor in the previous group of sharers.
Thus there is reason to believe that the history of other
processors could improve the performance of a consumer
predictor. Also, there is a measurable amount of negative
correlation. This negative correlation cannot be addressed by
previous predictors, except to a small extent the Two-Level
predictor for which no results were published [5].
We propose taking advantage of these correlations using
a perceptron. The computer science community has done a
great deal of work developing neural networks constructed
of perceptrons, each of which is trained to identify correla-
tions between its inputs, and the desired output. By tracking
correlations between the desired prediction and the inputs,
perceptrons dynamically isolate the relevant portions of the
input from irrelevant portions of the input. In addition, per-
ceptrons use information about the correlations between input
and output to generate predictions, as shown in Figure 2. Thus
a perceptron can take advantage of both negative and positive
correlations.
C. Perceptron Consumer Predictor
The perceptrons we use are identical in structure to those
proposed in [8] for branch prediction, and are located with the
history table. Each history table has a separate perceptron for
each potential consumer, with an overall topology shown in
Figure 3. Predictions are made as follows:
(i) The history table is indexed using some combination
of bits from the program counter, address, and directory.
(ii) The entry at that location is used as input to as many
perceptrons as the number of processors in the system.

Fig. 3. Predictor Architecture. Each history table has as many perceptrons
as the number of processors in the system. These perceptrons are used for
every entry in the history table. Each entry in the history table has a number
of bitmaps equal to the history depth, each of which contains a bit for each
processor.
Fig. 4. Structure of a Perceptron Predictor
(iii) The outputs at each perceptron correspond to the
predictions made for each processor.
We study a number of different history table configurations;
a single global history table, a history table at each directory,
and a history table at each processor. In all cases we find that
it is best to place a history table at each processor.
To make a prediction the history for a given index is used as
input to the perceptron, which has a corresponding weight for
each bit. The weights are either added to, or subtracted from
a sum, depending on the corresponding bit in the input. If
the sum is greater than zero the perceptron predicts positive,
otherwise the perceptron predicts negative. Figure 4 shows
how this works. The perceptron treats the presence of a
processor in a group of sharers as a 1, and its absence as
a 1.
D. Update Mechanism
There are two structures that need to be maintained for
the perceptron to work, the history table, and the perceptron
weight table. We address each of these tasks here.
The history table keeps depth bitmaps for each index. Each
bitmap contains the last set of sharers corresponding to this
index, which may include both PC and address information.
If the history tables are located at the directory it is relatively
easy to track all sharers, as the directory is responsible for
tracking that information in order to maintain coherence. If
the history tables are located at each processor it is slightly
more complicated. In this case information about sharers can
be piggybacked onto an existing response message from the
directory whenever a processor requests exclusive access.
Maintaining the perceptron weights will require an extra
message to be sent in a few rare cases. The perceptron weights
are updated only when another processor requests exclusive
access. At this time we know the set of all consumers of
the last write, and would like to pass that information to the
producer. If the consumer still has read permission, which
is likely given that this is the first write to occur since the
producer had exclusive access, we can attach this information
to the invalidate request sent to the producer. If the producer
has released access permission, the directory sends a message
to them with the consumer bitmap. We choose to update the
perceptron based on the prediction it would make when it
has recieved all the information needed to update. Doing this
prevents hysteresis effects, and reduces storage requirements.
It would also be possible to store the information needed to
update the perceptron when the original prediction was made,
but this would fail to account for more recent changes to the
perceptron, and would require additional information.
Once data arrives each perceptron is evaluated to see if it
needs to be updated based upon the prediction it would have
made. A perceptron is updated when its output disagrees with
the actual behavior of the system or if the magnitude of the
sum was less than some threshold. Each weight in the percep-
tron is incremented if the corresponding input agreed with the
output, and decremented if the input disagreed with the output.
Thus the threshold decides when the processor stops training.
A low threshold means that the resulting weights are able to
adapt more quickly if the behavior of the program changes. A
high threshold means that the perceptron itself will be slower
to change, and thus be less influenced by brief changes in
program behavior. In the taxonomy proposed in [5] we denote
this predictor as Perceptron
threshol d
(index)
depth
.
E. Implementation Issues
There are three main concerns with implementing a per-
ceptron based consumer predictor; the resources needed to do
so, the prediction latency, and the modifications needed to the
coherence protocol. One important thing to note with regard
to size is that the number of perceptrons used is relatively
low, N × H, where N is the number of processors and H is
the number of history tables. At any one history table there
will never be more than N perceptrons, each of which has
a total number of weights, N × depth. For a 16 processor
system with a depth of 4 (the largest we test) this corresponds
to 1024 different weights at each processor or directory. The
number of bits needed for each weight is 1+log
2
(thresh)
[8]. We explored a range of weights consumes between ve
and ten bits, for a total cost of 0.63 to 1.25 KB. A few adders
are needed to compute the predictions.
In addition to a potentially large size, perceptrons are
also slower than many predictors. In our case it takes a the
latency of six additions to fully calculate a prediction, where

Union and Intersection predictors only need to evaluate a
single bitwise logical operation. However, memory and request
latencies in multiprocessor systems are typically quite large.
Recent work in CMPs shows that the latencies of memory
requests on realistic CMPs is at least 120 cycles [9]. Given
that SMP systems will have longer communication times we
expect that the additional latency of 6 additions will have little
effect.
Implementing consumer set prediction as part of a tradi-
tional coherence protocol is potentially challenging for two
reasons. First, it is necessary to identify times at which to
distribute data. Our study focuses on identifying the consumers
of data so that the data can be forwarded to them before it is
requested. We do not describe when to do so. If data is dis-
tributed too early it may be requested by the original processor
before another processor reads the data. This will result in a
delay, as the original processor aquires write permission again,
and extra communication on the bus. These penalties are paid
even if the consumer prediction was correct. Second, data races
could be introduced into the coherence protocol. Eliminating
all of these race conditions is a challenging problem, which
has led to most coherence protocols in practice being simple,
or unverified.
Coherence decoupling proposed by Huh et. al. [4], greatly
simplifies these design issues. Coherence decoupling allows
speculative execution to occur in an out-of-order core based on
incoherent data. One proposed form of coherence decoupling
includes a speculative update. This speculative update writes
data to the bus before the coherence protocol would. Other
processors that possess an invalid copy of the line in their
cache update this invalid line with the new results. This allows
them to speculatively execute using data that they could not
have possessed yet if they had obeyed the coherence protocol.
Because this was implemented on a snoopy bus it is effectively
the same as predicting that all of the other processors are
consumers. On a bus this makes perfect sense, as transmitting
to additional consumers uses no extra resources. However,
broadcast can be expensive in other, more complex topologies.
Some form of consumer set prediction would be a natural ex-
tension to such an update mechanism in arbitrary interconnect
topologies.
IV. E
XPERIMENTAL RESULTS
A. Methodology
Our study evaluates a large number of different predictors,
searching the design space across depth, index, and function.
To facilitate this we use trace-based simulation. The sharing
patterns we study would be unchanged by implementing
coherence decoupling, and so feedback of the predictor on
the logical program execution can be ignored. We assume that
the L2 cache of each processor is infinite and use 128-byte
lines.
We gathered traces from the SPLASH-2 [10] benchmark
suite using GEMS [11]. We used only the Ruby module
of GEMS, simulating a 16 processor system with in-order
execution, a 64KB L1 cache, and a 16MB L2 cache. The
default input set was used for each benchmark.
0.3 0.4
0.5 0.6
0.7 0.8
Sensitivity
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
PVP
Intersection
Union
Up/Down Counter
Perceptron
Fig. 5. The set of co-optimal predictors found for each prediction function.
The perceptron has many more points because threshold was varied. With
threshold held constant only a few predictors are co-optimal. Note the offsets
on both axes.
We explore the space of predictors which use as many as
1M entries per history table, depths as high as 4, and the
prediction functions Union, Intersection, Two-Level PAs, and
Perceptron. We varied the threshold of the perceptron from 10
to 500. All the dynamic predictors were evaluated based upon
predictions made as the results became known.
While the history tables proposed are quite large, recent
research shows that it is possible to reduce this size substan-
tially with little effect on performance [19]. We also show
that the overall trends in these large tables are also present in
much smaller tables. Rather than display the top performers
according to either sensitivity or PVP we look at predictors
that are co-optimal in terms of sensitivity and PVP.
B. Prediction Accuracy
Figure 5 shows the set of co-optimal predictors generated
by each function in a 16 processor system. As you can see, the
perceptron completely dominates the Two-Level predictor, as
well as the more sensitive intersection predictors and higher
PVP union predictors. In order to choose a predictor for a full
system design it is necessary to know about the relative worth
of PVP and sensitivity. This will vary depending on a wide
range of design decisions, including the interconnect topology,
processor speed, processing core design, and coherence pro-
tocol. In general PVP is more important relative to sensitivity
when less bandwidth is available.
If PVP and sensitivity are of equal value a predictor’s
performance can be measured as the distance from itself to the
perfect predictor (PVP =1, Sensitivity =1). In the case
of this particular metric, perceptron prediction is on average
21% better than the next best predictor. Perceptron is at best
a distance of 0.483 from a perfect predictor. Intersection is
the next best at a distance of 0.608, followed by Union at
a distance of 0.609. Regardless of the metric chosen, the
perceptron predictor provides performance in a region of the
consumer-set design space that was previously unavailable.
Table II displays each of the predictors in the co-optimal
set. We found that the perceptron predictor did best with
a single indexing scheme, and can be tuned across a wide

Citations
More filters
Journal ArticleDOI
TL;DR: Three cache coherence mechanisms optimized for CMPs are presented, including a dynamic write-update mechanism augmented on top of a write-invalidate protocol, a bandwidth-adaptive mechanism to eliminate performance degradation from write-updates under limited bandwidth, and a proximity-aware mechanism to extend the base adaptive protocol with latency-based optimizations.
Abstract: In chip multiprocessors (CMPs), maintaining cache coherence can account for a major performance overhead. Write-invalidate protocols adapted by most CMPs generate high cache-to-cache misses under producer–consumer sharing patterns. Accordingly, this paper presents three cache coherence mechanisms optimized for CMPs. First, to reduce coherence misses observed in write-invalidate-based protocols, we propose a dynamic write-update mechanism augmented on top of a write-invalidate protocol. This mechanism is specifically triggered at the detection of a producer–consumer sharing pattern. Second, we extend this adaptive protocol with a bandwidth-adaptive mechanism to eliminate performance degradation from write-updates under limited bandwidth. Finally, proximity-aware mechanism is proposed to extend the base adaptive protocol with latency-based optimizations. Experimental analysis is conducted on a set of scientific applications from the SPLASH-2 and NAS parallel benchmark suites. The proposed mechanisms were shown to reduce coherence misses by up to 48% and in return speed up application performance up to 30%. Bandwidth-adaptive mechanism is proven to perform well under varying levels of available bandwidth. Results from our proposed proximity-aware extension demonstrated up to 6% performance gain over the base adaptive protocol for 64-core tiled CMP runs. In addition, the analytical model provided good estimates for performance gains from our adaptive protocols.

16 citations

Journal ArticleDOI
TL;DR: A new scheme has been proposed to reduce shared cache miss rate in multi-processor system-on-chips that benefits from novel prefetching techniques to L1 caches from off-chip memories or other remote L2 caches located on-chip.
Abstract: Cache miss can have a major impact on overall performance of many-core systems A miss may result in extra traffic and delay because of coherency messages This has been reduced in coarse-grain coherency protocols where only shared misses require a coherency message Conventional off-chip methods manage the shared miss rate by relying on reuse histories However the pertinent memory overhead that comes with reuse histories makes them impractical for on-chip multi-processor systems In this study, a new scheme has been proposed to reduce shared cache miss rate in multi-processor system-on-chips that benefits from novel prefetching techniques to L2 caches from off-chip memories or other remote L2 caches located on-chip In the proposed scheme, the previously proposed Virtual Tree Coherence (VTC) method has been extended to limit block forwarding messages to true sharers within each region Instead of relying on exact reuse histories, shared regions are searched for regional, temporal and statistical similarities These similarities are exploited for determining the sharers that should receive the forwarded blocks The proposed method has been evaluated with Splash-2 workloads Simulation results indicate that the proposed method has reduced shared miss count by up to 75%, and improved interconnect traffic by up to 47% compared with VTC

14 citations

Proceedings ArticleDOI
01 Dec 2012
TL;DR: This work proposes a new run-time coherence target prediction scheme that exploits the inherent correlation between synchronization points in a program and coherence communication and builds a predictor that can improve the miss latency of a directory protocol by 13%.
Abstract: Predicting target processors that a coherence request must be delivered to can improve the miss handling latency in shared memory systems. In directory coherence protocols, directly communicating with the predicted processors avoids costly indirection to the directory. In snooping protocols, prediction relaxes the high bandwidth requirements by replacing broadcast with multicast. In this work, we propose a new run-time coherence target prediction scheme that exploits the inherent correlation between synchronization points in a program and coherence communication. Our workload-driven analysis shows that by exposing synchronization points to hardware and tracking them at run time, we can simply and effectively track stable and repetitive communication patterns. Based on this observation, we build a predictor that can improve the miss latency of a directory protocol by 13%. Compared with existing address- and instruction-based prediction techniques, our predictor achieves comparable performance using substantially smaller power and storage overheads.

11 citations

Posted Content
TL;DR: A learning-aided approach to predict future data accesses is proposed and it is found that a powerful LSTM-based recurrent neural network model can provide high prediction accuracy based on only a cache trace as input.
Abstract: Caching techniques are widely used in the era of cloud computing from applications, such as Web caches to infrastructures, Memcached and memory caches in computer architectures. Prediction of cached data can greatly help improve cache management and performance. The recent advancement of deep learning techniques enables the design of novel intelligent cache replacement policies. In this work, we propose a learning-aided approach to predict future data accesses. We find that a powerful LSTM-based recurrent neural network model can provide high prediction accuracy based on only a cache trace as input. The high accuracy results from a carefully crafted locality-driven feature design. Inspired by the high prediction accuracy, we propose a pseudo OPT policy and evaluate it upon 13 real-world storage workloads from Microsoft Research. Results demonstrate that the new cache policy improves state-of-art practical policies by up to 19.2% and incurs only 2.3% higher miss ratio than OPT on average.

9 citations


Cites methods from "Perceptron Based Consumer Predictio..."

  • ...Because of its simplicity, the perceptron technique is used later for several other systems problems [22], [42]....

    [...]

Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper evaluates an adaptive protocol which targets write-update optimizations for producer-consumer sharing patterns and targets a minimalistic hardware extension approach to test the benefits of such adaptive protocols in a practical environment.
Abstract: Multi-core architectures also referred to as Chip Multiprocessors (CMPs) have emerged as the dominant architecture for both desktop and high-performance systems. CMPs introduce many challenges that need to be addressed to achieve the best performance. One of the big challenges comes with the shared-memory model observed in such architectures which is the cache coherence overhead problem. Contemporary architectures employ write-invalidate based protocols which are known to generate coherence misses that yield to latency issues. On the other hand, write-update based protocols can solve the coherence misses problem but they tend to generate excessive network traffic which is especially not desirable for CMPs. Previous studies have shown that a single protocol approach is not sufficient for many sharing patterns. As a solution, this paper evaluates an adaptive protocol which targets write-update optimizations for producer-consumer sharing patterns. This work targets a minimalistic hardware extension approach to test the benefits of such adaptive protocols in a practical environment. Experimental study is conducted on a 16-core CMP by using a full-system simulator with selected scientific applications from SPLASH-2 and NAS parallel benchmark suites. Results show up to 40% improvement for coherence misses which corresponds to 15% application speedup.

8 citations

References
More filters
Proceedings ArticleDOI
01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Abstract: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

4,002 citations


"Perceptron Based Consumer Predictio..." refers background in this paper

  • ...(ii) The entry at that location is used as input to as many perceptrons as the number of processors in the system....

    [...]

Journal ArticleDOI
TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers as mentioned in this paper, which includes a set of timing simulator modules for modeling the timing of the memory system and microprocessors.
Abstract: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL [9].

1,515 citations

01 Jan 2005
TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers and has released a set of timing simulator modules for modeling the timing of the memory system and microprocessors.

1,464 citations


"Perceptron Based Consumer Predictio..." refers background in this paper

  • ...(ii) The entry at that location is used as input to as many perceptrons as the number of processors in the system....

    [...]

Journal ArticleDOI
01 May 2005
TL;DR: Examination of the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multi-core design.
Abstract: This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest of the chip, potentially consuming a significant fraction of the real estate and power budget. This research shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multi-core design. Several examples are presented showing the need for careful co-design. For instance, increasing interconnect bandwidth requires area that then constrains the number of cores or cache sizes, and does not necessarily increase performance. Also, shared level-2 caches become significantly less attractive when the overhead of the resulting crossbar is accounted for. A hierarchical bus structure is examined which negates some of the performance costs of the assumed base-line architecture.

402 citations

Proceedings ArticleDOI
20 Jan 2001
TL;DR: A new method for branch prediction is presented that uses one of the simplest possible neural networks, the perceptron, as an alternative to the commonly used two-bit counters and achieves increased accuracy by making use of long branch histories.
Abstract: This paper presents a new method for branch prediction. The key idea is to use one of the simplest possible neural networks, the perceptron, as an alternative to the commonly used two-bit counters. Our predictor achieves increased accuracy by making use of long branch histories, which are possible becasue the hardware resources for our method scale linearly with the history length. By contrast, other purely dynamic schemes require exponential resources. We describe our design and evaluate it with respect to two well known predictors. We show that for a 4K byte hardware budget our method improves misprediction rates for the SPEC 2000 benchmarks by 10.1% over the gshare predictor. Our experiments also provide a better understanding of the situations in which traditional predictors do and do not perform well. Finally, we describe techniques that allow our complex predictor to operate in one cycle.

351 citations


"Perceptron Based Consumer Predictio..." refers background in this paper

  • ...In this pattern two processors have access to a piece of data at any time....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Perceptron based consumer prediction in shared-memory multiprocessors" ?

The authors make two main contributions in this paper. First, the authors show how to reduce the design space of consumer predictors to a set of interesting predictors, and how previous consumer predictors can be tuned to expand the range of available performance. Second, the authors propose a perceptron consumer predictor that dynamically adapts its reaction to the system behavior, and uses more history information than previous consumer predictors.