Fiscal Year 2023

Ver. 2024-01-18a

Course number: CSC.T433 School of Computing, Graduate major in Computer Science

# Advanced Computer Architecture

## 9. Instruction Level Parallelism: Out-of-order Execution and Multithreading

www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W834, Lecture (Face-to-face) Mon 13:30-15:10, Thr 13:30-15:10

Kenji Kise, Department of Computer Science kise \_at\_ c.titech.ac.jp

## Exploiting Instruction Level parallelism (ILP)

- A superscalar has to handle some flows efficiently to exploit ILP
  - Control flow (control dependence)
    - To execute *n* instructions per clock cycle, the processor has to fetch at least *n* instructions per cycle.
    - The main obstacles are branch instruction (BNE)
    - Prediction
    - Another obstacle is instruction cache
  - Register data flow (data dependence)
    - Out-of-order execution
      - Register renaming
      - Dynamic scheduling
  - Memory data flow
    - Out-of-order execution
    - Another obstacle is data cache



#### Instruction pipeline of OoO execution processor

- Allocating instructions to instruction window is called dispatch
- Issue or fire wakes up instructions and their executions begin
- In commit stage, the computed values are written back to ROB (reorder buffer)
- The last stage is called retire or graduate. The completed consecutive instructions can be retired. The result is written back to register file (architectural register file of 32 registers) using a logical register number from x0 to x31.



## The key idea for OoO execution (last lecture)

• In-order front-end, OoO execution core, in-order retirement using instruction window and reorder buffer (ROB)



## Register dataflow

• In-flight instructions are ones processing in a processor



Data flow graph



#### Case 1: Register dataflow from a far previous instn

- One source operand of insn I2 is from a retired instruction Ia.
- Because Ia is retired long ago, the physical destination register has been freed. The tag of the source register x3 can not be renamed at the renaming stage for I2, still having the logical register tag x3.
- Where does the operand x3 of I2 come from?

Ia: add x3,x0,x0
I1: sub p9,x1,x2
I2: add p10,p9,x3
I3: or p11,x4,x5
I4: and p12,p10,p11



#### Case 1: Register dataflow from RF

- One source operand of insn I2 is from a retired instruction Ia.
- Because Ia is retired long ago, the physical destination register has been freed. The tag of the source register x3 can not be renamed at the renaming stage for I2, still having the logical register tag x3.
- Where does the operand x3 of I2 come from?

Ia: add x3,x0,x0
I1: sub p9,x1,x2
I2: add p10,p9,x3
I3: or p11,x4,x5
I4: and p12,p10,p11



## Example behavior of register renaming and valid bit

- A processor remembers a set of renamed logical registers.
- If x1 and x2 are not renamed for in-flight insn, it uses x1 and x2 instead of p1 and p2



#### Case 2: Register dataflow

- Assume that one source operand p10 of insn I5 is from I2 which is not retired. The operand is generated a few clock cycles (tens of cycles sometimes) earlier.
- Because I2 is not retired, RF does not have the operand. Because I2 is committed, the operand is stored in ROB.
- Where does the operand of I5 come from?





#### Case 2: Register dataflow from ROB

- Assume that one source operand p10 of insn I5 is from I2 which is not retired. The operand is generated a few clock cycles (tens of cycles sometimes) earlier.
- Because I2 is not retired, RF does not have the operand. Because I2 is committed, the operand is stored in ROB.
- Where does the operand of I5 come from?





#### Case 3: Register dataflow

- Assume that the other source operand p12 of insn I5 is from I4 which is not committed. The operand is generated in the previous clock cycle.
- Because I4 is not retired, RF does not have the operand.
   Because I4 is not committed, ROB does not have the operand. Ia: add x3,
- Where does the operand of I5 come from?





#### Case 3: Register dataflow from ALUs

- Assume that the other source operand p12 of insn I5 is from I4 which is not committed. The operand is generated in the previous clock cycle.
- Because I4 is not retired, RF does not have the operand.
   Because I4 is not committed, ROB does not have the operand. Ia: add x3.
- Where does the operand of I5 come from?





## Reorder buffer (ROB)

Cycle 8

16

14

- Each ROB entry has following fields
  - entry valid bit, data valid bit, data, target register number, etc.
- ROB provides the large physical registers for renaming
  - in fact, physical register number is ROB entry number
- The value of a physical register is from a matching ROB entry

10

10

9

9

8

5

6

2

3



ROB

12

3

RF

### Datapath of OoO execution processor



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

14

## Instruction fetch unit of 2-way super-scalar

• High-bandwidth instruction delivery using prediction, and speculation





### Renaming two instructions per cycle for superscalar

• Renaming instruction IO and I1



## Datapath of OoO execution processor (partially)





17

## Aside: What is a window?

A window is a space in the wall of a building or in the side of a vehicle,
 which has glass in it so that light can come in and you can see out. (Collins)





#### Reservation station (RS)

- To simplify the wakeup and select logic at issue stage, each functional unit (ALU) has own instruction window, an entry for an instruction is called reservation station (RS).
- Each reservation station has
  - entry valid bit, <u>src1 tag, src1 data, src1 ready</u>, <u>src2 tag, src2 data, src2</u> <u>ready</u>, <u>destination physical register number (dst)</u>, operation, ...
  - The computed data (outcome) with its dst as tag is broadcasted to all RSs.







## Example behavior of reservation stations



dispatch at most two instructions, one to A or B and the other to C or D



### Example behavior of reservation stations



### Example behavior of reservation stations



dispatch at most two instructions, one to A or B and the other to C or D





• Example behavior of reservation stations







#### Instruction Level Parallelism (ILP)



#### Memory dataflow and branches

 The update of a data cache cannot be recovered easily. So, cache update is done at the retire stage in-order manner by using store queue.
 Because of the ambiguous memory dependency, load and store

Because of the ambiguous memory dependency, load and store instructions can be executed in-order manner.

- About 30% (or less) of executed instructions are load and stores.
- Even if they are executed in-order, IPC of 3 can be achieved.
- Branch instructions can be executed in-order manner.
  - About 20% (or less) of executed instructions are jump and branch instructions.
  - Out-or-order branch execution and aggressive miss recovery may cause false recovery (recovery by a branch on the false control path).

### Datapath of OoO execution processor



#### Pollack's Rule

 Pollack's Rule states that microprocessor "performance increase due to microarchitecture advances is roughly proportional to the square root of the increase in complexity". Complexity in this context means processor logic, i.e. its area.



WIKIPFDIA

#### From multi-core era to many-core era



| EV6 | EV6 | EV6 |
|-----|-----|-----|
| EV6 | EV6 | EV6 |
| EV6 | EV6 | EV6 |

#### Figure 1. Relative sizes of the cores used in the study

Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction, MICRO-36

#### From multi-core era to many-core era



Figure 1: Current and expected eras of Intel® processor architectures



#### Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, 2005

# Multithreading (1/2)

- During a branch miss recovery and access to the main memory by a cache miss, ALUs have no jobs to do and have to be idle.
- Executing multiple independent threads (programs) will mitigate the overhead. ٠
- They are called coarse- and fine-grained multithreaded processors having • multiple architecture states.



# Multithreading (2/2)

• Simultaneous Multithreading (SMT) can improve hardware resource usage.





Exercise 1

Cycle 0 dispatch I1, I2



Exercise 1

Cycle 0 dispatch I1, I2

