A Simple Divide-and-Conquer
Approach for Neural-Class
Branch Prediction
Gabriel H. Loh
College of Computing
Georgia Tech
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
2
aren’t we done with branch
predictors yet?
Branch predictors still important
Performance for large windows
ex. CPR
[Akkary et al./MICRO’03]
/CFP
[Srinivasan et al./ASPLOS’04]
Power
better bpred reduces wrong-path instructions
Throughput
wrong-path insts steal resources from other
threads in SMT/SOEMT
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
3
recent bpred research
“neural-inspired” predictors
perceptron, piecewise-linear, O-GEHL, …
very high accuracy
relatively high complexity
barrier to industrial adoption
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
4
outline
quick synopsis of neural techniques
gDAC predictor
idea
specifics
ahead-pipelining
results
why gDAC works
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
5
gshare
Records previous outcomes given a branch
identifier (PC) and a context (BHR)
Different contexts may lead to different
predictions for the same branch
Assumes
correlation
between context and the
outcome
1
1
0
0
1
0
foobar
hash
hash
taken
Branch history register (BHR)
Pattern History Table (PHT)
0
1
1
0
0
1
NT
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
6
gshare pros and cons
simple to implement!
variants exist in multiple real processors
not scalable for longer history lengths
# PHT entries grows exponentially
learning time increases
if only correlated to one previous branch, still need to
train 2
h
PHT counters
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
7
perceptron
explicitly locate the source(s) of correlation
h
2
h
1
h
0
1
h
2
h
1
h
0
1
0
0
1
1
0
0
! h
1
Table
Based
Approach
Perceptron
Approach
x
i
= h
i
? 1 : -1
f(X) = (0
*x
0
– 1*x
1
+ 0*x
2
) ≥ 0
w
0
w
1
w
2
weights track correlation
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
8
perceptron predictor
…
PC
…
BHR
Adder
Adder
≥0
Final
Prediction
Updating the weights:
If branch outcome agrees with h
i
,
then increment w
i
If disagree, decrement w
i
Downsides:
1.
Latency (SRAM lookup, adder tree)
2.
Few entries in table
aliasing
3.
Linearly separable functions only
Magnitude of weight reflects degree of
correlation. No correlation makes w
i
0
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
9
path-based neural predictor
…
PC
…
BHR
≥0
Final
Prediction
+
+
+
+
+
…
…
…
Perceptron: All weights chosen by PC
0
PBNP: w
i
selected by PC
i
(i
th
oldest PC)
•
Naturally leads to pipelined access
•
Different indexing reduces aliasing
Downsides:
1.
Latency (SRAM lookup,
one
adder)
2.
Complexity (30-50 stage bpred pipe)
3.
Linearly separable functions only
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
10
piecewise-linear predictor
…
PC
…
BHR
≥0
Final
Prediction
+
+
+
+
+
…
…
…
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
…
…
…
PC
Compute m different linear
functions in parallel
Some linearly inseparable
functions can be learned
Downsides:
1.
Latency (SRAM lookup, one
adder, one mux)
2.
Complexity (
m copies
of 50+
stage bpred pipe)
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
11
goal/scope
Neural predictors are
very
accurate
We want same level of performance
Neural predictors are complex
Large number of adders
Very deep pipelines
We want to avoid adders
We want to keep the pipe short
Preferable to use PHTs only
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
12
idea
very
long
branch
history
very
long
branch
history
Neural
Predictor
Neural
Predictor
very long
branch history
Neural Predictor
(Google images “hot dog kobayashi” – 2004 World Record 53½ Hot Dogs)
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
13
very
long
very
long
branch
history
branch
history
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
idea
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Meta
Meta
very long branch history
(random picture from Google images “hot dog eating”)
Make “digesting” a very long
branch history easier by
dividing up the responsibility!
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
14
unoptimized gDAC
g
D
A
C
g
lobal history
D
ivide
A
nd
C
onquer
Utilizes correlation from only
a single history segment
BHR[1:s
1
]
BHR[s
1
+1:s
2
]
BHR[s
2
+1:s
3
]
BHR[s
3
+1:s
4
]
PC
PHT
1
PHT
2
PHT
3
PHT
4
Prediction
Meta
gshare-styled
predictor
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
15
fusion gDAC
BHR[1:s
1
]
BHR[s
1
+1:s
2
]
BHR[s
2
+1:s
3
]
BHR[s
3
+1:s
4
]
PC
PHT
1
PHT
2
PHT
3
PHT
4
Prediction
Fusion Table
Can combine correlations
from multiple segments
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
16
gDAC
BHR[1:s
1
]
BHR[s
1
+1:s
2
]
BHR[s
2
+1:s
3
]
BHR[s
3
+1:s
4
]
PC
BM
1
BM
2
BM
3
BM
4
Prediction
Fusion Table
Better per-segment predictions
lead to a better final prediction
Bi-Mode style
predictor
Shared
Choice PHT
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
17
ahead-pipelined gDAC
Cycle t-3
PC
-3
Segment 1
Segment 2
Segment 3
Initial Hashing,
PHT Bank Select
Cycle t-2
Row Decoder
Cycle t-1
SRAM Array Access
PC
-1
Cycle t
PC
Prediction
Branch history from cycles t, t-1
and t-2 does not exist yet
Use PC
-1
for SRAM
column MUX selection
Each PHT SRAM organized to output
multiple counters (think “cache line”);
use current PC to select one
Branch history from cycles t,
t-1 and t-2 now available
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
18
comment on ahead-pipelining
Branch predictors composed of only PHTs
simple SRAMs easily ahead-pipelined
Seznec showed AP of 2bcgskew
[ISCA’02]
, fetch
in general
[ISCA’03]
Jim
é
nez showed AP-like gshare.fast
[HPCA’03]
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
19
simulation/configs
Standard stuff
SimpleScalar/Alpha (MASE), SPEC2k-INT, SimPoint
CPU Config similar to PWL study
[Jim
é
nez/ISCA’05]
gDAC vs. gshare, perceptron, PBNP, PWL
gDAC configs vary
2-3 segments
history length of 21 @ 2KB to 86 @ 128KB
neural advantage: gDAC tables constrained
to power-of-two entries, neural can use
arbitrary sizes
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
20
misprediction rates
2KB: About as accurate as
original perceptron
8KB: Beats original
perceptron
32KB: As accurate as
path-based neural pred
Piecewise Linear predictor
just does really well
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
21
performance
As accurate as perceptron, but
better latency
higher IPC
gDAC is less accurate than
path-neural @ 16KB, but
latency starting to matter
Latency difference allows
gDAC to even catch up with
PWL in performance
Goal achieved:
Neural-class performance
PHT-only complexity
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
22
so it works, but why?
correlation locality
correlation redundancy
correlation recovery
use perceptron as vehicle of analysis – it
explicitly assigns a correlation strength to
each branch
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
23
correlation locality
parser
Distinct clusters/bands of correlation
Segmenting (at the right places) should
not disrupt clusters of correlation
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
24
correlation locality
gcc
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
25
correlation redundancy
Using only the correlation from a
few branches yields almost as
much info as using all branches
Therefore the correlations detected
in the other weights are
redundant
!
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
26
correlation recovery
cross-segment correlation may exist
P
1
P
1
M
2,3
M
2,3
M
1,(2,3)
M
1,(2,3)
P
2
P
2
P
3
P
3
Prediction
Selection-based Meta can only
use correlation from one segment
P
1
P
1
P
2
P
2
P
3
P
3
fusion
fusion
Prediction
Fusion can (indirectly) use
correlation from all segments
Fusion gDAC
beats selection
gDAC by 4%
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
27
orthogonality
could use these ideas in other predictors
segmented history PPM predictor
segmented, geometric history lengths
some “segments” could use local history, prophet
“future” history, or anything else
may be other ways to exploit the general
phenomena
correlation locality, redundancy and recovery
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
28
summary
contributions
PHT-based long-history predictor
achieves goals of neural-accuracy, PHT complexity
ahead-pipelined organization
analysis of segmentation+fusion on correlation
Contact:
loh@cc.gatech.edu
http://www.cc.gatech.edu/~loh
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
29
BACKUP SLIDES
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
30
Power
Neural predictor update: lots of separate
small tables; extra decoders, harder to bank
All of the adders
Timing critical for perceptron – power hungry
Not as bad for PBNP (use small RCAs)
PWL (multiplies # adders considerably)
Checkpointing overhead for PBNP, PWL
Need to store 30-50+ partial sums
Per branch!
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
31
Power Density/Thermals
gDAC: can break up tables between
prediction bits and hysteresis bits (like EV8)
Neural must use all bits
Fetch
Decode
Rename
Commit
…
Physical separation reduces
power density/thermals
Similar for O-GEHL, PPM
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
32
linear (in)separability
Linearly separable only
Linearly separable
between segments
Linearly separable
within segments
Linearly inseparable
This does
The best
2005 Sep 20
PACT2005 - Loh - A Simple Divide-and-Conquer
Approach for Neural-Class Branch Prediction
33
per-benchmark accuracy
(128KB)