Slide 1

A Simple Divide-and-ConquerApproach for Neural-ClassBranch Prediction

Gabriel H. Loh

College of Computing

Georgia Tech

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

aren’t we done with branchpredictors yet?

Branch predictors still important

Performance for large windows

ex. CPR[Akkary et al./MICRO’03]/CFP[Srinivasan et al./ASPLOS’04]

Power

better bpred reduces wrong-path instructions

Throughput

wrong-path insts steal resources from otherthreads in SMT/SOEMT

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

recent bpred research

“neural-inspired” predictors

perceptron, piecewise-linear, O-GEHL, …

very high accuracy

relatively high complexity

barrier to industrial adoption

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

outline

quick synopsis of neural techniques

gDAC predictor

idea

specifics

ahead-pipelining

results

why gDAC works

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

gshare

Records previous outcomes given a branchidentifier (PC) and a context (BHR)

Different contexts may lead to differentpredictions for the same branch

Assumes correlation between context and theoutcome

foobar

hashhash

taken

Branch history register (BHR)

Pattern History Table (PHT)

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

gshare pros and cons

simple to implement!

variants exist in multiple real processors

not scalable for longer history lengths

# PHT entries grows exponentially

learning time increases

if only correlated to one previous branch, still need totrain 2h PHT counters

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

perceptron

explicitly locate the source(s) of correlation

! h1

Table

Based

Approach

Perceptron

Approach

xi = hi ? 1 : -1

f(X) = (0*x0 – 1*x1 + 0*x2) ≥ 0

weights track correlation

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

perceptron predictor

…

BHR

AdderAdder

≥0

Final

Prediction

Updating the weights:

If branch outcome agrees with hi,

then increment wi

If disagree, decrement wi

Downsides:

1.Latency (SRAM lookup, adder tree)

2.Few entries in table  aliasing

3.Linearly separable functions only

Magnitude of weight reflects degree of

correlation. No correlation makes wi0

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

path-based neural predictor

…

BHR

≥0

Final

Prediction

…

Perceptron: All weights chosen by PC0

PBNP: wi selected by PCi (ith oldest PC)

• Naturally leads to pipelined access

• Different indexing reduces aliasing

Downsides:

1.Latency (SRAM lookup, one adder)

2.Complexity (30-50 stage bpred pipe)

3.Linearly separable functions only

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

piecewise-linear predictor

…

BHR

≥0

Final

Prediction

…

Compute m different linearfunctions in parallel

Some linearly inseparablefunctions can be learned

Downsides:

1.Latency (SRAM lookup, one

adder, one mux)

2.Complexity (m copies of 50+

stage bpred pipe)

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

goal/scope

Neural predictors are very accurate

We want same level of performance

Neural predictors are complex

Large number of adders

Very deep pipelines

We want to avoid adders

We want to keep the pipe short

Preferable to use PHTs only

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

idea

very long branch historyvery long branch history

Neural PredictorNeural Predictor

very long

branch history

Neural Predictor

(Google images “hot dog kobayashi” – 2004 World Record 53½ Hot Dogs)

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

very long very long

branch historybranch history

PredictorPredictor

idea

PredictorPredictor

MetaMeta

very long branch history

(random picture from Google images “hot dog eating”)

Make “digesting” a very long

branch history easier by

dividing up the responsibility!

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

unoptimized gDAC

gDACglobal history Divide And Conquer

Utilizes correlation from only

a single history segment

BHR[1:s1]

BHR[s1+1:s2]

BHR[s2+1:s3]

BHR[s3+1:s4]

PHT1

PHT2

PHT3

PHT4

Prediction

Meta

gshare-styled

predictor

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

fusion gDAC

BHR[1:s1]

BHR[s1+1:s2]

BHR[s2+1:s3]

BHR[s3+1:s4]

PHT1

PHT2

PHT3

PHT4

Prediction

Fusion Table

Can combine correlations

from multiple segments

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

gDAC

BHR[1:s1]

BHR[s1+1:s2]

BHR[s2+1:s3]

BHR[s3+1:s4]

BM1

BM2

BM3

BM4

Prediction

Fusion Table

Better per-segment predictions

lead to a better final prediction

Bi-Mode style

predictor

Shared

Choice PHT

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

ahead-pipelined gDAC

Cycle t-3

PC-3

Segment 1

Segment 2

Segment 3

Initial Hashing,

PHT Bank Select

Cycle t-2

Row Decoder

Cycle t-1

SRAM Array Access

PC-1

Cycle t

Prediction

Branch history from cycles t, t-1and t-2 does not exist yet

Use PC-1 for SRAM

column MUX selection

Each PHT SRAM organized to outputmultiple counters (think “cache line”);use current PC to select one

Branch history from cycles t,t-1 and t-2 now available

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

comment on ahead-pipelining

Branch predictors composed of only PHTs

simple SRAMs easily ahead-pipelined

Seznec showed AP of 2bcgskew [ISCA’02], fetchin general [ISCA’03]

Jiménez showed AP-like gshare.fast [HPCA’03]

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

simulation/configs

Standard stuff

SimpleScalar/Alpha (MASE), SPEC2k-INT, SimPoint

CPU Config similar to PWL study [Jiménez/ISCA’05]

gDAC vs. gshare, perceptron, PBNP, PWL

gDAC configs vary

2-3 segments

history length of 21 @ 2KB to 86 @ 128KB

neural advantage: gDAC tables constrainedto power-of-two entries, neural can usearbitrary sizes

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

misprediction rates

2KB: About as accurate asoriginal perceptron

8KB: Beats originalperceptron

32KB: As accurate aspath-based neural pred

Piecewise Linear predictorjust does really well

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

performance

As accurate as perceptron, butbetter latency  higher IPC

gDAC is less accurate thanpath-neural @ 16KB, butlatency starting to matter

Latency difference allowsgDAC to even catch up withPWL in performance

Goal achieved:

Neural-class performance

PHT-only complexity

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

so it works, but why?

correlation locality

correlation redundancy

correlation recovery

use perceptron as vehicle of analysis – itexplicitly assigns a correlation strength toeach branch

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

correlation locality

parser

Distinct clusters/bands of correlation

Segmenting (at the right places) should

not disrupt clusters of correlation

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

correlation locality

gcc

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

correlation redundancy

Using only the correlation from afew branches yields almost asmuch info as using all branches

Therefore the correlations detected

in the other weights are redundant!

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

correlation recovery

cross-segment correlation may exist

P1P1

M2,3M2,3

M1,(2,3)M1,(2,3)

P2P2

P3P3

Prediction

Selection-based Meta can only

use correlation from one segment

P1P1

P2P2

P3P3

fusionfusion

Prediction

Fusion can (indirectly) use

correlation from all segments

Fusion gDAC

beats selection

gDAC by 4%

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

orthogonality

could use these ideas in other predictors

segmented history PPM predictor

segmented, geometric history lengths

some “segments” could use local history, prophet“future” history, or anything else

may be other ways to exploit the generalphenomena

correlation locality, redundancy and recovery

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

summary

contributions

PHT-based long-history predictor

achieves goals of neural-accuracy, PHT complexity

ahead-pipelined organization

analysis of segmentation+fusion on correlation

Contact:

loh@cc.gatech.edu

http://www.cc.gatech.edu/~loh

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

BACKUP SLIDES

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

Power

Neural predictor update: lots of separatesmall tables; extra decoders, harder to bank

All of the adders

Timing critical for perceptron – power hungry

Not as bad for PBNP (use small RCAs)

PWL (multiplies # adders considerably)

Checkpointing overhead for PBNP, PWL

Need to store 30-50+ partial sums

Per branch!

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

Power Density/Thermals

gDAC: can break up tables betweenprediction bits and hysteresis bits (like EV8)

Neural must use all bits

Fetch

Decode

Rename

Commit

…

Physical separation reduces

power density/thermals

Similar for O-GEHL, PPM

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

linear (in)separability

Linearly separable only

Linearly separable

between segments

Linearly separable

within segments

Linearly inseparable

This does

The best

2005 Sep 20

PACT2005 - Loh - A Simple Divide-and-ConquerApproach for Neural-Class Branch Prediction

per-benchmark accuracy

(128KB)