## Advanced Computer Architecture

## 6. Instruction Level Parallelism:

## Instruction Fetch and Branch Prediction

www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W834, Lecture (Face-to-face) Mon 13:30-15:10, Thr 13:30-15:10

Kenji Kise, Department of Computer Science kise_at_c.titech.ac.jp

## Exploiting Instruction Level Parallelism (ILP)

- A superscalar has to handle some flows efficiently to exploit ILP
- Control flow (control dependence)
- To execute $n$ instructions per clock cycle, the processor has to fetch at least $n$ instructions per cycle.
- The main obstacles are branch instruction (BNE)
- Prediction
- Another obstacle is instruction cache
- Register data flow (data dependence)
- Out-of-order execution
(1) add $x 5, x 1, x 2$
- Register renaming
- Dynamic scheduling
(2) add $x 9, x 5, x 3$
(3) $1 w x 4,4(x 7)$
(4) add $x 8, x 9, x 4$
- Memory data flow
(3) $1 w x 4,4(x 7)$
- Out-of-order execution
(1) add $x 5, x 1, x 2$
(2) add $x 9, x 5, x 3$
- Another obstacle is instruction cache (4) add $\times 8, \times 9, \times 4$


## Branch predictor

- A branch predictor is a digital circuit that tries to guess or predict which way (taken or untaken) a branch will go before this is known definitively.
- A random predictor will achieve about a $50 \%$ hit rate because the prediction output is 1 or 0 .
- Let's guess the accuracy. What is the accuracy of typical branch predictors for high-performance commercial processors?


## Sample program: vector add (function v_add)

```
#define VSIZE 4
void v_add(int *A, int *B, int *C){
    for(i=0; i<VSIZE; i++)
        C[i] += (A[i] + B[i]);
}
```

Basic block contains a sequence of statement.
The flow of control enters at the beginning of the statement and leave at the end.


## Simple branch predictor: 2-bit counter (2BC)

- It uses two bit register as a saturating counter.
- How to update the register
- If the branch outcome is taken and the value is not 3, then increment the register.
- If the branch outcome is untaken and the value is not 0 , then decrement the register.
- Hot to predict
- It predicts as 1 if the MSB of the register is one, otherwise predicts as 0 .


Predicting the sequence of 11101110111011101110 ...
State of the counter 23332333233323332333 ... Prediction Hit/Miss or the pred. 11111111111111111111 ...
HHHM HHHM HHHM HHHM HHHM

## Sample program: vector add with two branches

```
#define VSIZE 4
void v_add(int *A, int *B, int *C){
    for(i=0; i<VSIZE; i++) {
        if(A[i]<0) error_routine();
        C[i] += (A[i] + B[i]);
    }
}
```

Basic block contains a sequence of statement. The flow of control enters at the beginning of the statement and leave at the end.


Control flow graph
 Predicting the sequence of $010101000101010001010100 \ldots$

## Sample program: vector add with two branches



Predicting the branch outcome sequence of

$$
010101000101010001010100 \text {... }
$$

The B4's sequence of $010101000101010001010100 \ldots$

The B2’s sequence of 010101000101010001010100 ...

## Simple branch predictor: bimodal

- Program has many static branch instructions. The behavior may depend on each branch. Use plenty of counters (PHT) and assign a counter for a branch instruction.
- How to predict
- Select a 2-bit counter using PC, and it predicts 1 for taken if the MSB of the register is one; otherwise, it predicts 0 for untaken.
- How to update
- Select a counter using PC, then update the counter in the same way as 2bit counter.



## Simple branch predictor: bimodal

Predicting the sequence of $010101000101010001010100 \ldots$ The B4's sequence of $010101000101010001010100 \ldots$ State of the counter 210000000000 Prediction 10000000 0000 Hit/Miss or the pred. M H H H H H H H H H H H ... The B2's sequence of State of the counter Prediction Hit/Miss or the pred.

$$
010101000101010001010100 \text {... }
$$

23332333
23
33
. .
$\begin{array}{lllllllllllll}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & \ldots\end{array}$
H H H M H H H M H H H M ...

Pattern History Table (PHT)

## Program

 Counter$\square$


## Accuracy of simple predictors with 8KB HW budget



Benchmark for CBP(2004) by Intel MRL and IEEE TC uARCH.

## 5-stage pipelining RISC-V processor with data forwarding

- The strategy is to separate instruction fetch step (IF), instruction decode step (ID), execution step (EX), memory access step (MA), and write back step (WB).
- Use the pipeline registers P1, P2, P3, P4.



## Why do branch instructions degrade IPC?

- The branch taken / untaken is determined in the execution (EX) stage of the branch.
- The conservative approach is stalling instruction fetch until the branch direction is determined.
- It is too conservative to be practical.

| 1. add |  | cc1 | cc2 | cc3 | cc4 | cc5 | cc6 | cc7 | cc8 | cc9 | cc10 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | IF | ID | EX | MEM | WB |  |  |  |  |  |
| 2. | add | IF | ID | EX | MEM | WB |  |  |  |  |  |
| 3. | bne |  | IF | ID | EX | MEM | WB |  |  |  |  |
| 4. | add |  | Contro | depen | ency | IF | ID | EX | MEM | WB |  |
| 5. | add |  |  |  |  | IF | ID | EX | MEM | WB |  |
|  | add |  |  |  |  |  | IF | ID | EX | MEM | WB |
| 7. | add |  |  |  |  |  | IF | ID | EX | MEM | WB |

2-way superscalar processor executing instruction sequence with a branch

## Why do branch instructions degrade IPC?

- The branch taken / untaken is determined in the execution (EX) stage of the branch.
- Prediction and speculation, then training
- Recovery whan a prediction miss

| 1. | add | cc1 | cc2 | cc3 | cc4 | cc5 | cc6 | cc7 | cc8 | cc9 | cc10 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | IF | ID | EX | MEM | WB |  |  |  |  |  |
| 2. | add | IF | ID | EX | MEM | WB |  |  |  |  |  |
| 3. | bne |  | IF | ID | EX | MEM | WB |  |  |  |  |
|  | add |  |  | IF | ID | EX | MEM | WB |  |  |  |
|  | add |  |  | IF | ID | EX | MEM | WB |  |  |  |
|  | add |  |  |  | IF | ID | EX | MEM | WB |  |  |
|  | add |  |  |  | IF | ID | EX | MEM | WB |  |  |

2-way superscalar processor executing instruction sequence with a branch
Speculative execution performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed.

## Why do branch instructions degrade IPC?

- The branch taken / untaken is determined in execution stage of the branch.
- Prediction and speculation, then training
- Recovery whan a prediction miss
- If it turns out a prediction miss, some results are ignored and some changes made by the speculative execution are recovered.



## Instruction fetch unit of 2-way super-scalar

- High-bandwidth instruction delivery using prediction, and speculation

IF stage
ID, EX, MA, WB stage


## An innovation in branch predictors in 1993

- Using branch history
- global branch history
- local branch history
- 2-level branch predictor and gshare
- Assume predicting the sequence 11101110111011101110 ...


Use the recent branch history as an address of a table.

## Recommended Reading

- Combining Branch Predictors
- Scott McFarling, Digital Western Research Laboratory
- WRL Technical Note TN-36, 1993
- A quote:
"In this paper, we have presented two new methods for improving branch prediction performance. First, we showed that using the bitwise exclusive OR of the global branch history and the branch address to access predictor counters results in better performance for a given counter array size."


## Gshare (TR-DEC 1993)

- How to predict
- Using the exclusive OR of the global branch history and PC to access PHT, then MSB of the selected counter is the prediction.
- How to update
- Shifting BHR one bit left and update LSB by branch outcome in IF stage.
- Update the used counter in the same way as $2 B C$ in WB stage.



## Bi-Mode (MICRO 1997)

- A choice predictor (bimodal) is used as a meta-predictor
- How to predict
- Like gshare, both of Taken PHT and Untaken PHT make two predictions.
- Select one among them by the choice predictor which tracks the global bias of a branch.
- How to update
- The used PHT is updated in the same way as $2 B C$.
- Choice predictor is updated in the same way as bimodal.



## To go beyond gshare

- Using branch history
- global branch history
- local branch history
- 2-level branch predictor and gshare
- Assume predicting the sequence 11101110111011101110 ...

```
11101110 ?
111011101 ?
1110111011 ?
11101110111 ?
111011101110 ?
11101110 ?
111011101 ?
1110111011 ?
11101110111 ?
111011101110 ?
```


## Perceptron (HPCA 2001)

- How to predict
- Select one perceptron by PC
- Compute y using the equation. It predicts 1 if $y>=0$, predicts 0 if $y<0$
- $x$ is branch history. $x i$ is either -1 , meaning not taken or 1 , meaning taken
$y=w_{0}+\sum_{i=1}^{n} x_{i} w_{i}$


Program Counter

8 bit weight $\times 29=232$ bit
Table of Perceptrons (w)
Branch History ( $x$ )


$$
\begin{aligned}
& \text { if } \operatorname{sign}\left(y_{\text {out }}\right) \neq t \text { or }\left|y_{\text {out }}\right| \leq \theta \text { then } \\
& \text { for } i:=0 \text { to } n \text { do } \\
& \qquad w_{i}:=w_{i}+t x_{i} \\
& \text { end for } \\
& \text { end if } \\
& \qquad T=1.93 n+14
\end{aligned}
$$

- How to update
- Train the weights of used perceptron when the prediction miss or $|y|<T$ (Threshold)


## Exercise 1

## - How to predict

- Select one perceptron by PC
- Compute y using the equation. It predicts 1 if $\mathrm{y}>=0$, predicts 0 if $\mathrm{y}<0$
- $x$ is branch history. $x i$ is either -1 , meaning not taken or 1 , meaning taken


## - How to update

- Train the weights of used perceptron when the prediction miss or $|y|<T$ (Threshold)

$$
\begin{aligned}
& \text { if } \operatorname{sign}\left(y_{\text {out }}\right) \neq t \text { or }\left|y_{\text {out }}\right| \leq \theta \text { then } \\
& \text { for } i:=0 \text { to } n \text { do } \\
& \qquad w_{i}:=w_{i}+t x_{i} \\
& \text { end for }
\end{aligned}
$$

end if

$$
T=1.93 n+14
$$

T=21.72: bias $\mathrm{T}=21.72$ : bias(

1) -1 $\mathrm{T}=21.72$ : bias( T=21.72: bias T=21.72: bias T=21.72: bias( T=21.72: bias T=21.72: bias( T=21.72: bias T=21.72: bias T=21.72: bias( $\mathrm{T}=21.72$ : bias( T=21.72: bias( $\mathrm{T}=21.72$ : bias $\mathrm{T}=21.72$ : bias T=21.72: bias( $\mathrm{T}=21.72$ : bias $\mathrm{T}=21.72$ : bias ( 10 ) T=21.72: bias(
T=21.72: bias( 9) -

$$
\mathrm{T}=21.72: \operatorname{bias}(10)
$$

T=21.72: bias(
10)

$$
\mathrm{T}=21.72: \operatorname{bias}(10)
$$

$$
\mathrm{T}=21.72: \text { bias }(9)
$$

$$
\text { T=21.72: bias( } 10)
$$

$$
\mathrm{T}=21.72: \text { bias }(10)
$$

$$
\text { T=21.72: bias( } 10 \text { ) }
$$

T=21.72: bias(
9) $\mathrm{T}=21.72$ : bias( 9) T=21.72: bias( 10) T=21.72: bias( 10) $\mathrm{T}=21.72$ : bias( 9) T=21.72: bias( T=21.72: bias ( $\mathrm{T}=21.72$ : bias( 10 )
-1
-2
-1
-2 -
-2
-2 :
pred=1 outcome=1
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=1 outcome=0 : miss
pred=0 outcome=1 : miss
pred=0 outcome=1 : miss
pred=1 outcome=1 : hit
pred=1 outcome=0 : miss
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=0 outcome=0 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=0 outcome=0 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=0 outcome=0 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=0 outcome=0 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=0 outcome=0 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=0 outcome=0 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit
pred=1 outcome=1 : hit

## Perceptron (HPCA 2001)

## The Neural Network in Your CPU

Sun, Aug 6, 2017

Machine learning and artificial intelligence are the current hype (again). In their new Ryzen processors, AMD advertises the Neural Net Prediction. It turns out this is was already used in their older (2012) Piledriver architecture used for example in the AMD A10-4600M. It is also present in recent Samsung processors such as the one powering the Galaxy S7. What is it really?

The basic idea can be traced to a paper from Daniel Jimenez and Calvin Lin "Dynamic Branch Prediction with Perceptrons", more precisely described in the subsequent paper "Neural methods for dynamic branch prediction". Branches typically occur in if-then-else statements. Branch prediction consists in guessing which code branch, the then or the else, the code will execute, thus allowing to precompute the branch in parallel for faster evaluation.

Jimenez and Lin rely on a simple single-layer perceptron neural network whose input are the branch outcome (global or hybrid local and global) histories and the outbut predicts which branch will be taken. In realitv. because there is a sinale laver.

https://www. anandtech.com/Gallery/Album/5197\#18
https://chasethedevil.github.io/post/the_neural_network_in_your_cpu/

## Branch predictors based on pattern matching

- Find the longest matching pattern (green rectangle)
- Select the proper matching length or long matching pattern (blue rectangle)
- Count the number of 0 and the number of 1 after the long matting patterns (red rectangle), then predict by majority vote.

Global branch history Prediction 0 or 1


## Partial Pattern Matching, PPM or TAGE (CBP 2004)



## Partial Pattern Matching, PPM or TAGE (CBP 2004)



The original launch of the 'Zen' architecture in the Ryzen 1000 series desktop processors featured clock speeds up to 4 GHz , and were manufactured on the 14 nm manufacturing node. This was followed the next year with the Ryzen 2000 series featuring updated 'Zen+' architecture, which was die-shrunk to the 12 nm node and delivered higher clock speeds with about 3\% higher IPC (instructions per clock) delivered higher clock speeds with about 3\% higher IPC (instructions per clock)
compared to its predecessor. Despite this modest increase, it delivered up to 15\% compared to its predecessor. Despite this modest increase, it delivered up to 15\%
higher gaming performance due to updates like Precision Boost 2 and XFR 2, thanks in part to a clock speed increase up to 4.3 GHz .


The next major 'Zen' revision was 'Zen3', which debuted in AMD Ryzen 5000 series desktop processors. This comprehensive design overhaul delivered a further 19\% IPC increase thanks to over 20 major changes, which included: wider and more flexible execution resources; significantly more load/store bandwidth to feed execution; and a streamlined front-end to get more threads in flight-and do it faster. It also transitioned to a new "unified complex" design that brought 8 cores and 32MB of L3 cache into a single group of resources. This dramatically reduced core-to-core and

## Prediction accuracy

- The accuracy of 4 KB Gshare is about $93 \%$.
- The accuracy of 4KB PPM is about $97 \%$.



## Recommended Reading

- Prophet-Critic Hybrid Branch Prediction
- Ayose Falcon, UPC, Jared Stark, Intel, Alex Ramirez, UPC, Konrad Lai, Intel, Mateo Valero
- ISCA-31 pp. 250-261 (2004)


## A quote from Introduction (1/2)

Conventional predictors are analogous to a taxi with just one driver. He gets the passenger to the destination using knowledge of the roads acquired from previous trips; i. e., using history information stored in the predictor's memory structures.
When he reaches an intersection, he uses this knowledge to decide which way to turn.
The driver accesses this knowledge in the context of his current location.
Modern branch predictors access it in the context of the current location (the program counter) plus a history of the most recent decisions that led to the current location.

## A quote from Introduction (2/2)

Prophet/critic hybrids are analogous to a taxi with two drivers: the front-seat and the back-seat. The front-seat driver has the same role as the driver in the single-driver taxi. This role is called the prophet. The back-seat driver has the role of critic. She watches the turns the prophet makes at intersections. She doesn't say anything unless she thinks he's made a wrong turn. When she thinks he's made a wrong turn, she waits until he's made a few more turns to be certain they are lost. (Sometimes the prophet makes turns that initially look questionable, but, after he makes a few more turns, in hindsight appear to be correct.) Only when she's certain does she point out the mistake.
To recover, they backtrack to the intersection where she believes the wrong-turn was made and try a different direction.

## Prophet-Critic Hybrid Branch Prediction




Figure 5. Effect of varying the number of future bits used by the critic on prediction accuracy for selected benchmarks. (prophet: 8KB perceptron; critic: 8KB tagged gshare)

