Fiscal Year 2024

Ver. 2024-12-18a

Course number: CSC.T433 School of Computing, Graduate major in Computer Science

# Advanced Computer Architecture

# 4. Pipelining

www.arch.cs.titech.ac.jp/lecture/ACA/ Room No. W8E-308, Lecture (Face-to-face) Mon 13:30-15:10, Thr 13:30-15:10

Kenji Kise, Department of Computer Science kise \_at\_ c.titech.ac.jp

## Single-cycle implementation of processors

 Single-cycle implementation also called single clock cycle implementation is the implementation in which an instruction is executed in one clock cycle. While easy to understand, it is too slow to be practical. It is useful as a baseline for lectures.



# Critical path of rvcore1 (single-cycle version)

• It is too slow to be practical.



#### Single-cycle implementation of laundry

- (A) Ann, (B) Brian, (C) Cathy, and (D) Don each have dirty clothes to be washed, dried, folded, and put away, each taking 30 minutes.
- The cycle time (the time from the end of one load to the end of the next one) is 2 hours.
- For four loads, the sequential laundry takes 8 hours.



# Single-cycle implementation and pipelining

- When the washing of load A is finished at 6:30 p.m., another washing of load B starts.
- Pipelined laundry takes 3.5 hours just using the same hardware resources. The cycle time is 30 minutes.
- What is the latency (execution time) of each load?





#### Bucket brigade





## Clock rate is mainly determined by

- Switching speed of gates (transistors)
- The number of levels of gates
  - The maximum number of gates cascaded in series in any combinational logics.
  - In this example, the number of levels of gates is 3.
- Wiring delay and fanout



# Pipelining example: multiply-add operation (1)

- As an example of pipelining, we will see a multiply-add circuit.
- r\_b, r\_c are input registers and r\_y is output register of the circuit.
- This has two paths named path1 and path2, and path1 is the critical path to determine the maximum operating frequency.







# Pipelining example: multiply-add operation (2)

- By inserting register r\_d, the critical path can be divided into Path3 and Path4.
- As a result, the new critical path becomes Path3.
- This has the disadvantage that input b and c in the same clock cycle cannot be processed.





# Pipelining example: multiply-add operation (3)

- To overcome this drawback, we insert register r\_e.
- This realizes a pipeline with stages 1 and 2.
   A set of registers between two adjacent stages are called a pipeline register.



(a) original multiply-add circuit



#### rvcore\_2s : 2-stage pipelining processor

 The strategy is to separate the instruction fetch (IF) step and other (ID, EX, MA, WB) steps. The first stage is named IF. The other stage is named EX+.



#### rvcore\_2s : 2-stage pipelining processor



(a) rvcore\_2s: 2-stage pipelining processor



#### (b) a pipeline diagram of rvcore\_2s

#### Why do branch instructions degrade IPC?

- The branch taken / untaken (branch result) is determined in the EX+ stage of the branch.
- The conservative approach is stalling instruction fetch until the branch direction is determined.



two-stage pipelining processor executing instruction sequence with a branch

#### Why do branch instructions degrade IPC?

- Another approach is fetching the following instruction (an instruction at the next address) when a branch (bne) is fetched.
- When a branch (08 bne) is taken, the wrong instruction fetched (Oc add) is flushed.



## Verilog HDL code for rvcore\_2s



module m rvcore 2s ( //// RVCore two-stage pipelining version input wire w clk, // clock signal input wire w rst, // reset signal output wire [31:0] D ADDR, // data memory, address output wire [31:0] D OUT, // data memory, output data output wire D WE // data memory, write enable ); reg [31:0] r pc; reg [31:0] P1\_ir, P1\_pc; /\* pipeline register \*/ /\* pipeline register \*/ reg P1 v; wire w jalr, w op im, w b rslt; wire [9:0] w itype; wire [10:0] w alu c; wire [6:0] w bru c; wire [4:0] w rs1, w rs2, w rd; wire [31:0] w ir, w rrs1, w rrs2 t, w rrs2, w imm t, w imm; wire [31:0] w a rslt, w rslt, w tkn pc, w ldd; wire w miss = w b rslt & P1 v; m imem imem1 (w\_clk, r\_pc, w\_ir); m decoder decoder1 (P1 pc, P1 ir, w rd, w rs1, w rs2, w\_op\_im, w\_itype, w\_jalr, w\_alu\_c, w\_bru\_c, w\_imm\_t, w\_imm); m regfile regfile1 (w clk, w rs1, w rs2, w rrs1, w rrs2 t, w rd, P1 v, w rs1t); assign w\_rrs2 = (w\_op\_im) ? w\_imm : w\_rrs2\_t; m alu alu1 (w rrs1, w rrs2, w alu c, w a rslt); m bru bru0 (w rrs1, w rrs2, w bru c, w b rslt); assign D ADDR = w rrs1 + w imm; assign D OUT = w rrs2; assign D WE = w itype[`D S TYPE] & P1 v; m dmem dmem1 (w clk, D\_WE, D\_ADDR, D\_OUT, w\_ldd); assign w rslt = (w itype[`D LD IS]) ? w ldd : w a rslt; assign w tkn pc = ((w jalr) ? w rrs1 : P1 pc) + w imm t; always @(posedge w clk) begin r pc <= (w rst) ? `START PC :</pre> (w b rslt & P1 v) ? w tkn pc : r pc + 4; P1 ir  $\leq w$  ir; P1 pc <= r pc; P1 v <= (w rst) ? 0 : !w miss; end

#### rvcore\_3s : 3-stage pipelining processor

The strategy is to separate the instruction fetch (IF) step, instruction decode (ID) step, and other (EX, MA, WB) steps. The first stage is named IF. The second stage is named ID. The last stage is named EX+.



#### Exercise 1

Draw the main datapath of the processor rvcore\_3s and write the valid values on wires when the processor is executing these three instructions





# m\_regfile with bypassing

```
module m regfile ( //// register file
    input wire
                       w clk,
    input wire [ 4:0] rs1, rs2,
    output wire [31:0] rdata1, rdata2,
    input wire [4:0] rd,
    input wire
                       we,
    input wire [31:0] wdata
);
    reg [31:0] mem [0:31];
    integer i; initial begin for(i=0; i<32; i=i+1) mem[i]=0; end</pre>
    assign rdata1 = (rs1==0) ? 0 : mem[rs1];
    assign rdata2 = (rs2==0) ? 0 : mem[rs2];
    always @(posedge w_clk) if(rd!=0 && we) mem[rd] <= wdata;</pre>
endmodule
```

m\_regfile for rvcore1 and rvcore\_2s

```
module m regfile ( //// register file
                              input wire
                                                 w clk,
                              input wire [4:0] rs1, rs2,
                              output wire [31:0] rdata1, rdata2,
                              input wire [4:0] rd,
                              input wire
                                                 we,
                              input wire [31:0] wdata
                          );
                              reg [31:0] mem [0:31];
                              integer i; initial begin for(i=0; i<32; i=i+1) mem[i]=0; end</pre>
                              wire bp1 = (we & rs1==rd); // bypassing for rdata1
                              wire bp2 = (we & rs2==rd); // bypassing for rdata2
                              assign rdata1 = (rs1==0) ? 0 : (bp1) ? wdata : mem[rs1];
                              assign rdata2 = (rs2==0) ? 0 : (bp2) ? wdata : mem[rs2];
                              always @(posedge w clk) if(rd!=0 && we) mem[rd] <= wdata;
m_regfile for rvcore_3s
                          endmodule
```

rs1

rs2

rd

wdata

regfile

5

5

rdata1

rdata2

32

32

rs1

rs2

rd

vdata

regfile

5

32

rdata1

data2

32

32

#### Why do branch instructions degrade IPC?

- Another approach is fetching the following instructions (Oc add, 10 add) after a branch (bne) is fetched.
- When a branch (08 bne) is taken, the wrong instructions fetched (Oc add, 10 add) are flushed.



three-stage pipelining processor executing instruction sequence with a taken branch

# rvcore\_4s : 4-stage pipelining processor

 The strategy is to separate the instruction fetch (IF) step, instruction decode (ID) step, and write back(WB) step, and other (EX, MA) steps. The first stage is named IF. The second stage is named ID. The third stage is named EX+. The last stage is named WB.



#### rvcore\_4s : 4-stage pipelining processor

- Critical path
  - deley(register read) + delay(mux) + delay(mux) + delay(adder) + delay(dmem read)



#### Exercise 2

Draw the main datapath of the processor rvcore\_4s and write the valid values on wires when the processor is executing these three instructions in ID, EX+, and WB stages

0x00 addi x1, x0, 3 # x1 = 3 0x04 addi x2, x1, 4 # x2 = 3 + 4 = 7 0x08 add x5, x1, x2 # x5 = 3 + 7 = 10



#### Why do branch instructions degrade IPC?

- Another approach is fetching the following instructions (Oc add, 10 add) after a branch (bne) is fetched.
- When a branch (08 bne) is taken, the wrong instructions fetched (Oc add, 10 add) are flushed.



four-stage pipelining processor executing instruction sequence with a taken branch

#### Comparison of critical path between rvcore1 and rvcore\_4s



#### **Recommended Reading**

#### • Increasing Processor Performance by Implementing Deeper Pipelines

- Eric Sprangle , Doug Carmean (Intel Corporation)
- ISCA-2002 pp. 25-34 (2002)

This paper will show that the branch misprediction latency is the single largest contributor to performance degradation as pipelines are stretched, and therefore branch prediction and fast branch recovery will continue to increase in importance. We will also show that higher performance cores, implemented with longer pipelines for example, will put more pressure on the memory system, and therefore require larger on-chip caches. Finally, we will show that in the same process technology, designing deeper pipelines can increase the processor frequency by 100%, which, when combined with larger on-chip caches can yield performance improvements of 35% to 90% over a Pentium® 4 like processor.

| Ba     | asi | c Pe  | ntium  | III P   | roces  | sor N  | lispre | dictic  | n Pip    | eline |
|--------|-----|-------|--------|---------|--------|--------|--------|---------|----------|-------|
| 1      |     | 2     | 3      | 4       | 5      | 6      | 7      | 8       | 9        | 10    |
| E at a |     | Fetch | Decode | Deserte | Decede | Banama | DOD D4 | Rdy/Sch | Dispatch | Exec  |
| Fetc   |     |       |        |         |        |        |        |         |          |       |
|        |     |       | ntium  | • 4 Pr  | ocess  | sor M  | ispreo | dictio  | n Pipe   | eline |
|        |     |       |        | • 4 Pr  | ocess  | sor M  | ispreo | dictio  | n Pipe   |       |

