2024年度(令和6年)版

Ver. 2024-10-15a

Course number: CSC.T363

コンピュータアーキテクチャ Computer Architecture

4. キャッシュ:ダイレクトマップ方式 Caches: Direct-Mapped

www.arch.cs.titech.ac.jp/lecture/CA/ Tue 13:30-15:10, 15:25-17:05 Fri 13:30-15:10

CSC.T363 Computer Architecture, Department of Computer Science, TOKYO TECH

吉瀬 謙二 情報工学系 Kenji Kise, Department of Computer Science Kise \_at\_ c.titech.ac.jp 1

# A Typical Memory Hierarchy

By taking advantage of the principle of locality (局所性)

Present much memory in the cheapest technology



#### **RISC-V** Reference Card

#### Free & Open 🛃 RISC-V Reference Card 🔹 🛈

| Base Integ           | or Ins | tructio   | ons: RV32I, RV | /641. and R | V1281                                    |       | R              | / Privileged       | Instru  | tions         |
|----------------------|--------|-----------|----------------|-------------|------------------------------------------|-------|----------------|--------------------|---------|---------------|
| Category Nam         |        |           | RV32I Base     |             | 54,128}                                  |       | Category       | Name               |         | / mnemonic    |
| Loads Load By        |        | LB        | rd,rs1,imm     |             | ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,  |       | CSR Access     |                    |         | rd,csr,rs1    |
| Load Halfwo          |        | LH        | rd,rs1,imm     |             |                                          |       |                | Read & Set Bit     |         | rd,csr,rs1    |
| Load Wo              |        | LW        | rd,rs1,imm     | L{D Q} r    | d,rsl,                                   | imm   |                | Read & Clear Bit   |         | rd,csr,rs1    |
| Load Byte Unsign     |        | LBU       | rd,rs1,imm     |             |                                          |       |                | tomic R/W Imm      |         |               |
| Load Half Unsign     |        | LHU       | rd,rs1,imm     | L{W D}U r   | d,rsl,                                   | i.mm  |                | d & Set Bit Imm    |         |               |
| Stores Store By      | _      | SB        | rs1,rs2,imm    |             |                                          |       |                | & Clear Bit Imm    |         |               |
| Store Halfwo         | rd S   | SH        | rs1,rs2,imm    |             |                                          |       | Change Lev     |                    |         |               |
| Store Wo             | rd S   | SW        | rs1,rs2,imm    | S{D Q} r    | s1,rs2                                   | ,imm  | Environm       | ent Breakpoint     | EBREAK  |               |
| Shifts Shift Le      | ft R   | SLL       | rd,rs1,rs2     | SLL{W D} r  | d,rs1,                                   | cs2   | Envi           | ronment Return     | ERET    |               |
| Shift Left Immedia   | te I   | SLLI      | rd,rs1,shamt   | SLLI{W D} r | d,rs1,s                                  | shamt | Trap Redire    | ct to Superviso    | MRTS    |               |
| Shift Rig            | ht R   | SRL       | rd,rs1,rs2     | SRL{WD} r   | d,rs1,                                   | cs2   | Redirect Tra   | p to Hypervisor    | MRTH    |               |
| Shift Right Immedia  | te I   | SRLI      | rd, rs1, shamt | SRLI{W D} r | d,rs1,                                   | shamt | Hypervisor Tra | p to Supervisor    | HRTS    |               |
| Shift Right Arithme  | tic R  | SRA       | rd,rs1,rs2     | SRA{W D} r  | d,rsl,                                   | cs2   | Interrupt W    | Vait for Interrupt | WFI     |               |
| Shift Right Arith Im | m I    | SRAI      | rd, rs1, shamt | SRAI{W D} r | d,rsl,                                   | shamt | MMU Su         | pervisor FENCE     | SFENCE  | .VM rsl       |
| Arithmetic AD        | DR     | ADD       | rd,rs1,rs2     | ADD{W D} r  | d,rsl,                                   | ts2   |                |                    |         |               |
| ADD Immedia          | te I   | ADDI      | rd,rsl,imm     | ADDI{W D} r | d,rsl,                                   | i.mm  |                |                    |         |               |
| SUBtra               | ct R   | SUB       | rd,rs1,rs2     | SUB{W D} r  | d,rs1,1                                  | cs2   |                |                    |         |               |
| Load Upper Im        | m U    | LUI       | rd,imm         | Option      | Optional Compressed (16-bit) Instruction |       |                |                    | n Exter | ision: RVC    |
| Add Upper Imm to I   | × U    | AUIPC     | rd,imm         |             | Name                                     | Fmt   |                | VC                 |         | 'I equivalent |
| Logical XO           | R R    | XOR       | rd,rs1,rs2     | Loads Loa   | d Word                                   | CL    | C.LW rd'       | ,rsl',imm          | LW rd'  | rsl',imm*4    |
| XOR Immedia          | te I   | XORI      | rd,rs1,imm     | Load V      | Word SP                                  | CI    | C.LWSP rd,     | imm                | LW rd,s | sp,imm*4      |
| 0                    | RR     | OR        | rd,rs1,rs2     | Load        | Double                                   | CL    | C.LD rd'       | ,rs1',imm          | LD rd'  | rs1',imm*8    |
| OR Immedia           | te I   | ORI       | rd,rs1,imm     | Load Do     | uble SP                                  | CI    | C.LDSP rd,     | imm                | LD rd,  | sp,imm*8      |
| AM                   | ID R   | AND       | rd,rs1,rs2     | Loa         | ad Quad                                  | CL    | C.LQ rd'       | ,rsl',imm          | LQ rd'  | rs1',imm*16   |
| AND Immedia          | te I   | ANDI      | rd,rsl,imm     | Load (      | Quad SP                                  | CI    | C.LQSP rd,     | imm                | LQ rd,  | sp,imm*16     |
| Compare Set          | < R    | SLT       | rd,rs1,rs2     | Stores Stor | e Word                                   | CS    | C.SW rs1       | ',rs2',imm         | SW rsl  | ',rs2',imm*4  |
| Set < Immedia        | te I   | SLTI      | rd,rs1,imm     | Store \     | Word SP                                  | CSS   | C.SWSP rs2     | ,imm               | SW rs2  | sp,imm*4      |
| Set < Unsign         | ed R   | SLTU      | rd,rs1,rs2     |             | Double                                   |       |                | ',rs2',imm         | SD rsl  | ,rs2',imm*8   |
| Set < Imm Unsign     | ed I   | SLTIU     | rd,rs1,imm     | Store Do    | ouble SP                                 | CSS   | C.SDSP rs2     | ,imm               | SD rs2  | sp,imm*8      |
| Branches Branch      |        | BEQ       | rs1,rs2,imm    | Sto         | re Quad                                  |       |                | ',rs2',imm         | SQ rs1  | ',rs2',imm*16 |
| Branch               |        | BNE       | rs1,rs2,imm    |             | Quad SP                                  |       |                | ,imm               | SQ rs2  | sp,imm*16     |
| Branch               |        | BLT       | rs1,rs2,imm    | Arithmetic  | ADD                                      |       |                | rd,rsl             | ADD 1   | d,rd,rs1      |
| Branch               |        | BGE       | rs1,rs2,imm    |             | D Word                                   |       |                | rd,rs1             | ADDW 1  | d,rd,imm      |
| Branch < Uneion      | al cr  | דויייי דמ | rol rol imm    |             | modiato                                  | CT    | C ADDT         | val imm            |         | and and imm   |



#### https://www.arch.cs.titech.ac.jp/lecture/CA/RISCVGreenCard.pdf

#### **RISC-V** instruction set simulator

- venus is a RISC-V instruction set simulator built for education.
  - https://venus.kvakil.me/
  - https://github.com/kvakil/venus

| addi x2, x0, 10                                                    |                  | Simulator | Editor          |              |                |
|--------------------------------------------------------------------|------------------|-----------|-----------------|--------------|----------------|
| add x3, x1, x2                                                     | Registers Memory |           | Prev Reset Dump | Run Step     |                |
|                                                                    | 0x0000000        |           |                 |              |                |
| sample sequence 2                                                  | 0x0000000        |           | Original Code   | Basic Code   | Machine Code   |
| Sumple Sequence L                                                  | 0x7ffffff0       |           | addi x1, x0, 1  | addi x1 x0 1 | 0x00100093     |
| addi x1, x0, 1<br>addi x2, x0, 10                                  | 0x10000000       |           | addi x2, x0, 2  | addi x2 x0 2 | 0x00200113     |
| L: addi x1, x1, 1<br>bne x1, x2, L                                 | 0x0000000        |           |                 |              |                |
| sample sequence 3                                                  | 0x0000000        |           |                 |              |                |
| <pre>lui x1, 0x123 ori x1, x1, 0x456 sw x1, 32(x0)</pre>           | 0x00000000       |           |                 |              |                |
| <pre>lw x2, 32(x0) lb x3, 32(x0) lb x4, 33(x0) lb x5, 34(x0)</pre> | 0x0000000        |           |                 |              | console output |



sample sequence 1

addi x1. x0. 1

## little-endian, big-endian

In a little-endian configuration, multibyte stores write the least-significant register byte at the lowest memory byte address, followed by the other register bytes in ascending order of their significance. Loads similarly transfer the contents of the lesser memory byte addresses to the less-significant register bytes.

In a big-endian configuration, multibyte stores write the most-significant register byte at the lowest memory byte address, followed by the other register bytes in descending order of their significance. Loads similarly transfer the contents of the greater memory byte addresses to the less-significant register bytes.



パレートの法則

- Vilfredo Federico Damaso Pareto
  - イタリアの経済学者(1848 1923)
- パレートの法則
  - 全体の数値の大部分は、全体を構成するうちの一部の要素が生み 出している
  - 80:20の法則



The Memory Hierarchy: Why Does it Work?

• Temporal Locality (時間的局所性, Locality in Time): ⇒ Keep most recently accessed data items closer to the

processor

Spatial Locality (空間的局所性, Locality in Space):
 ⇒ Move blocks consisting of contiguous words to the upper levels



#### Cache

- Two questions to answer (in hardware):
  - Q1: How do we know if a data item is in the cache?
  - Q2: If it is, how do we find it?
- Direct mapped
  - For each item of data at the lower level, there is **exactly one location** in the cache where it might be - so lots of items at the lower level must share locations in the upper level
  - Address mapping:
     (block address) modulo (# of blocks in the cache)
  - First, consider block sizes of one word

# Caching: A Simple First Example



Two low order bits define the byte in the word (32bit word)

```
Q2: How do we find it?
```

Use **next 2 low order memory address bits** – the **index** – to determine which cache block

(block address) modulo (# of blocks in the cache)

## Direct Mapped Cache Example

One word/block, cache size = 1K words





#### What kind of locality are we taking advantage of?

#### Example Behavior of Direct Mapped Cache

Consider the main memory word reference string (word addresses)
 0
 1
 2
 3
 4
 3
 4
 15

Start with an empty cache - all blocks initially marked as not valid



8 requests, 6 misses

## Another Reference String Mapping

Consider the main memory word reference string



- 8 requests, 8 misses
  - Ping pong effect due to conflict misses two memory locations that map into the same cache block

#### Direct Mapped Cache Example

One word/block, cache size = 1K words





#### Direct Mapped Cache Example



#### Multiword Block Direct Mapped Cache

• Four words/block, cache size = 1K words



#### Taking Advantage of Spatial Locality

Let cache block hold more than one word

#### 0 1 2 3 4 3 4 15



8 requests, 4 misses

## Handling Cache Hits (Miss is the next issue)

- Read hits (I\$ and D\$)
  - this is what we want!
- Write hits (D\$ only)
  - allow cache and memory to be inconsistent
    - write the data only into the cache block (write-back)
    - need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted
  - require the cache and memory to be consistent
    - always write the data into both the cache block and the next level in the memory hierarchy (write-through) so don't need a dirty bit
    - writes run at the speed of the next level in the memory hierarchy so slow! – or can use a write buffer, so only have to stall if the write buffer is full



#### Write Buffer for Write-Through Caching



- Write buffer between the cache and main memory
  - Processor: writes data into the cache and the write buffer
  - Memory controller: writes contents of the write buffer to memory
- The write buffer is just a **FIFO** 
  - Typical number of entries: 4
  - Works fine if store frequency is low
- Memory system designer's nightmare, write buffer saturation
  - One solution is to use a write-back cache; another is to use an L2 cache

# Handling Cache Misses

- Read misses (I\$ and D\$)
  - stall the entire pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume
- Write misses (D\$ only)
  - Write allocate
    - (a) single-word block: write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall
    - (b) multi-word block: stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache, write the word from the processor to the cache, then let the pipeline resume
  - No-write allocate skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer is not full



#### Hardware debug





# ScalableCore system



#### Verilator, the fastest Verilog/SystemVerilog simulator

#### Welcome to Verilator

"Big 3'

Vendor A Vendor B

Welcome to Verilator, the fastest Verilog/SystemVerilog simulator.

- Accepts Verilog or SystemVerilog
- Performs lint code-quality checks
- Compiles into multithreaded C++, or SystemC

Verilator

Verilator

2

Creates XML to front-end your own tools



#### Fast

- Outperforms many closed-source commercial simulators
- Single- and multithreaded output models

#### code001.v

```
module main ();
initial begin
    $write("hello, world¥n");
    $finish();
end
endmodule
```

https://www.veripool.org/verilator/

Verilator

Verilator

thread threads threads threads threads

\$ verilator --binary code001.v \$ obj\_dir/Vcode001 hello, world - code001.v:8: Verilog \$finish - S i m u l a t i o n R e p o r t: Verilator 5.026 2024-06-15 - Verilator: \$finish at 1ps; walltime 0.001 s; speed 0.000 s/s - Verilator: cpu 0.000 s on 1 threads; alloced 121 MB