



Fiscal Year 2019

Ver. 2020-01-05a

Course number: CSC.T433  
School of Computing,  
Graduate major in Computer Science

# Advanced Computer Architecture

## 8. Instruction Level Parallelism: Dynamic Scheduling

[www.arch.cs.titech.ac.jp/lecture/ACA/](http://www.arch.cs.titech.ac.jp/lecture/ACA/)  
Room No.W936  
Mon 13:20-14:50, Thr 13:20-14:50

Kenji Kise, Department of Computer Science  
kise\_at\_c.titech.ac.jp



# Scalar and Superscalar processors

- Scalar processor can execute at most one single instruction per clock cycle using one ALU.
  - IPC (Executed Instructions Per Cycle) is less than 1.
- Superscalar processor can execute more than one instruction per clock cycle by executing multiple instructions using multiple pipelines.
  - IPC (Executed Instructions Per Cycle) can be more than 1.
  - using  $n$  pipelines is called  $n$ -way superscalar



(a) pipeline diagram of scalar processor



(b) pipeline diagram of 2-way superscalar processor

# Instruction fetch unit in IF stage

- For high-bandwidth instruction delivery, prediction, and speculation



# Exploiting Instruction Level parallelism (ILP)

- A superscalar processor has to handle some flows efficiently to exploit ILP
  - Control flow
    - To execute  $n$  instructions per clock cycle, the processor has to fetch at least  $n$  instructions per cycle.
    - The main obstacles are branch instruction (BNE, BEQ)
    - Another obstacle is instruction cache
  - Register data flow
    - Dynamic scheduling
  - Memory data flow



# Exploiting Instruction Level Parallelism (ILP)



## Prediction & speculation



Control flow graph

What is the solution?

4 cycles for 4 insns  
ILP = 1.0



3 cycles for 4 insns  
ILP = 1.33



# Exercise: what is data dependence

- Draw a data flow graph for each instruction stream

$R3 = R2 + 1 \ (1)$

$R5 = R4 + 2 \ (2)$

$R7 = R6 + 3 \ (3)$

Instruction stream 1

$R3 = R2 + 1 \ (1)$

$R5 = R4 + 2 \ (2)$

$R7 = R3 + 3 \ (3)$

Instruction stream 2

$R3 = R2 + 1 \ (1)$

$R3 = R4 + 2 \ (2)$

$R7 = R6 + 3 \ (3)$

Instruction stream 3

$R3 = R2 + 1 \ (1)$

$R5 = R4 + 2 \ (2)$

$R4 = R6 + 3 \ (3)$

Instruction stream 4



# True data dependence



- Insn i writes a register that insn j reads, **RAW** (read after write)
- Program order must be preserved to ensure insn j receives the value of insn i.

$$R3 = R3 \times R5 \quad (1)$$

$$R4 = R3 + 1 \quad (2)$$

$$R3 = R5 + 2 \quad (3)$$

$$R7 = R3 + R4 \quad (4)$$

Assume  $R3=10, R5=3$

$$20 = 10 \times 2 \quad (1)$$

$$21 = 20 + 1 \quad (2)$$

$$5 = 3 + 2 \quad (3)$$

$$26 = 5 + 21 \quad (4)$$

Assume  $R3=10, R5=3$

$$20 = 10 \times 2 \quad (1)$$

$$21 = 20 + 1 \quad (2)$$

$$41 = 20 + 21 \quad (4)$$

$$5 = 3 + 2 \quad (3)$$



# Output dependence



- Insn i and j write the same register, **WAW** (write after write)
- Program order must be preserved to ensure that the value finally written corresponds to instruction j.

$$R3 = R3 \times R5 \quad (1)$$

$$R4 = R3 + 1 \quad (2)$$

$$R3 = R5 + 2 \quad (3)$$

$$R7 = R3 + R4 \quad (4)$$

Assume  $R3=10, R5=3$

$$20 = 10 \times 2 \quad (1)$$

$$21 = 20 + 1 \quad (2)$$

$$5 = 3 + 2 \quad (3)$$

$$26 = 5 + 21 \quad (4)$$

Assume  $R3=10, R5=3$

$$5 = 3 + 2 \quad (3)$$

$$20 = 10 \times 2 \quad (1)$$

$$21 = 20 + 1 \quad (2)$$

$$41 = 20 + 21 \quad (4)$$



# Antidependence



- Insn i reads a register that insn j writes, **WAR** (write after read)
- Program order must be preserved to ensure that i reads the correct value.

$$R3 = R3 \times R5 \quad (1)$$

$$R4 = R3 + 1 \quad (2)$$

$$R3 = R5 + 2 \quad (3)$$

$$R7 = R3 + R4 \quad (4)$$

Assume  $R3=10, R5=3$

$$20 = 10 \times 2 \quad (1)$$

$$21 = 20 + 1 \quad (2)$$

$$5 = 3 + 2 \quad (3)$$

$$26 = 5 + 21 \quad (4)$$

Assume  $R3=10, R5=3$

$$20 = 10 \times 2 \quad (1)$$

$$5 = 3 + 2 \quad (3)$$

$$6 = 5 + 1 \quad (2)$$

$$11 = 5 + 6 \quad (4)$$



# Data dependence and renaming

- True data dependence (RAW)
- Name dependences
  - Output dependence (WAW)
  - Antidependence (WAR)

$$R3 = R3 \times R5 \quad (1)$$

$$R4 = R3 + 1 \quad (2)$$

$$R8 = R5 + 2 \quad (3)$$

$$R7 = R8 + R4 \quad (4)$$

$$\begin{array}{ll} R3 = R3 \times R5 & (1) \\ R4 = R3 + 1 & (2) \\ R3 = R5 + 2 & (3) \\ R7 = R3 + R4 & (4) \end{array}$$



# Hardware register renaming

- Logical registers (architectural registers) which are ones defined by ISA
  - \$0, \$1, ... \$31
- Physical registers
  - Assuming plenty of registers are available, p0, p1, p2, ...
- A processor renames (converts) each logical register to a unique physical register dynamically

Typical instruction pipeline of scalar processor



Typical instruction pipeline of high-performance superscalar processor



# Exercise: register renaming

- Rename the following instruction stream using physical registers of p9, p10, p11, and p12

I0: sub \$5,\$1,\$2

I1: add \$9,\$5,\$4

I2: or \$5,\$5,\$2

I3: and \$2,\$9,\$1



# Example behavior of register renaming (1/4)

- Renaming the first instruction I0

Cycle 1

I0: sub \$5,\$1,\$2  
I1: add \$9,\$5,\$4  
I2: or \$5,\$5,\$2  
I3: and \$2,\$9,\$1



dst = \$5  
src1 = \$1  
src2 = \$2

Register map table

|    |        |
|----|--------|
| 0  | 0      |
| 1  | 1      |
| 2  | 2      |
| 3  | 3      |
| 4  | 4      |
| 5  | 5 -> 9 |
| 6  | 6      |
| 7  | 7      |
| 8  | 8      |
| 9  |        |
| 10 |        |
| 31 |        |

dst = p9  
src1 = p1  
src2 = p2

I0: sub p9,p1,p2

# Example behavior of register renaming (2/4)

- Renaming the second instruction I1

Cycle 2

I0: sub \$5,\$1,\$2  
I1: add \$9,\$5,\$4  
I2: or \$5,\$5,\$2  
I3: and \$2,\$9,\$1

Free tag buffer



dst = \$9  
src1 = \$5  
src2 = \$4

Register map table

|    |      |
|----|------|
| 0  | 0    |
| 1  | 1    |
| 2  | 2    |
| 3  | 3    |
| 4  | 4    |
| 5  | 9    |
| 6  | 6    |
| 7  | 7    |
| 8  | 8    |
| 9  | ->10 |
| 10 |      |
| 31 |      |

dst = p10  
src1 = p9  
src2 = p4

I0: sub p9,p1,p2  
I1: add p10,p9,p4



# Example behavior of register renaming (3/4)

- Renaming instruction I2

Cycle 3

I0: sub \$5,\$1,\$2  
I1: add \$9,\$5,\$4  
**I2: or \$5,\$5,\$2**  
I3: and \$2,\$9,\$1



dst = \$5  
src1 = \$5  
src2 = \$2

Register map table

|    |         |
|----|---------|
| 0  | 0       |
| 1  | 1       |
| 2  | 2       |
| 3  | 3       |
| 4  | 4       |
| 5  | 9 -> 11 |
| 6  | 6       |
| 7  | 7       |
| 8  | 8       |
| 9  | 10      |
| 10 |         |
| 31 |         |

dst = p11  
src1 = p9  
src2 = p2

I0: sub p9,p1,p2  
I1: add p10,p9,p4  
**I2: or p11,p9,p2**



# Example behavior of register renaming (4/4)

- Renaming instruction I3

Cycle 4

I0: sub \$5,\$1,\$2  
I1: add \$9,\$5,\$4  
I2: or \$5,\$5,\$2  
**I3: and \$2,\$9,\$1**



dst = \$2  
src1 = \$9  
src2 = \$1

Register map table



dst = p12  
src1 = p10  
src2 = p1

I0: sub p9,p1,p2  
I1: add p10,p9,p4  
I2: or p11,p9,p2  
**I3: and p12,p10,p1**



# Renaming two instructions per cycle for superscalar

- Renaming instruction I0 and I1

## Cycle 1

I0: sub \$5,\$1,\$2

I1: add \$9,\$5,\$4

I2: or \$5,\$5,\$2

I3: and \$2,\$9,\$1

## Free tag buffer



dst = \$5  
src1 = \$1  
src2 = \$2

dst = \$9  
src1 = \$5  
src2 = \$4

## Register map table

|    |        |
|----|--------|
| 0  | 0      |
| 1  | 1      |
| 2  | 2      |
| 3  | 3      |
| 4  | 4      |
| 5  | 5 -> 9 |
| 6  | 6      |
| 7  | 7      |
| 8  | 8      |
| 9  | -> 10  |
| 10 |        |
| 31 |        |

dst = p9  
src1 = p1  
src2 = p2

dst = p10  
src1 = p5  
src2 = p4

I0: sub p9,p1,p2  
I1: add p10,p5,p4 (Wrong)



# Renaming two instructions per cycle for superscalar

- Renaming instruction I0 and I1

## Cycle 1

I0: sub \$5,\$1,\$2

I1: add \$9,\$5,\$4

I2: or \$5,\$5,\$2

I3: and \$2,\$9,\$1

### Free tag buffer



I0    A\_dst = \$5  
 A\_src1 = \$1  
 A\_src2 = \$2

I1    B\_dst = \$9  
 B\_src1 = \$5  
 B\_src2 = \$4

### Register map table



A\_dst = p9  
 A\_src1 = p1  
 A\_src2 = p2

B\_dst = p10  
 B\_src1 = p9  
 B\_src2 = p4

If B\_src1 == A\_dst, use tag from free tag buffer

I0: sub p9,p1,p2  
 I1: add p10,p9,p4



## Pollack's Rule

- Pollack's Rule states that microprocessor "performance increase due to microarchitecture advances is roughly proportional to the square root of the increase in complexity". Complexity in this context means processor logic, i.e. its area.

WIKIPEDIA

