

2011-06-28

2012年 前学期 TOKYO TECH



# 計算機アーキテクチャ 第一 (E)

## メモリ3: キャッシュ

吉瀬 謙二 計算工学専攻

kise\_at\_cs.titech.ac.jp

W641講義室 木曜日13:20 – 14:50

## Direct Mapped Cache

- Consider the main memory word reference string

Start with an empty cache - all

blocks initially marked as not valid

0 1 2 3 4 3 4 15

Tag 0 miss

|    |        |
|----|--------|
| 00 | Mem(0) |
|    |        |
|    |        |
|    |        |

1 miss

|    |        |
|----|--------|
| 00 | Mem(0) |
| 00 | Mem(1) |
|    |        |
|    |        |

2 miss

|    |        |
|----|--------|
| 00 | Mem(0) |
| 00 | Mem(1) |
| 00 | Mem(2) |
|    |        |

3 miss

|    |        |
|----|--------|
| 00 | Mem(0) |
| 00 | Mem(1) |
| 00 | Mem(2) |
| 00 | Mem(3) |

4 miss

|    |        |
|----|--------|
| 01 | Mem(4) |
| 00 | Mem(1) |
| 00 | Mem(2) |
| 00 | Mem(3) |

3 hit

|    |        |
|----|--------|
| 01 | Mem(4) |
| 00 | Mem(1) |
| 00 | Mem(2) |
| 00 | Mem(3) |

4 hit

|    |        |
|----|--------|
| 01 | Mem(4) |
| 00 | Mem(1) |
| 00 | Mem(2) |
| 00 | Mem(3) |

15 miss

|    |        |
|----|--------|
| 01 | Mem(4) |
| 00 | Mem(1) |
| 00 | Mem(2) |
| 00 | Mem(3) |

11 15

- 8 requests, 6 misses

2

**Multiword Block Direct Mapped Cache**

- Four words/block, cache size = 1K words

Address fields:

- 31 30 ... 13 12 11 ... 4 3 2 1 0
- Byte offset
- Block offset

Cache organization:

- IndexValid Tag
- Index
- Data (4 word)
- 256 blocks
- 256 entries per block
- 4 words per block

Write-back policy:

- Write-to-Block
- Block offset
- 32 bytes

## Taking Advantage of Spatial Locality



- **Read hits (I\$ and D\$)**
  - this is what we want!
- **Write hits (D\$ only)**
  - allow cache and memory to be **inconsistent**
    - write the data only into the cache block (**write-back**)
    - need a **dirty** bit for each data cache block to tell if it needs to be written back to memory when it is evicted
  - require the cache and memory to be **consistent**
    - always write the data into both the cache block and the next level in the memory hierarchy (**write-through**) so don't need a dirty bit
    - writes run at the speed of the next level in the memory hierarchy – **so slow!** – or can use a **write buffer**, so only have to stall if the write buffer is full

```

graph LR
    Processor[Processor] --> Cache[Cache]
    Cache --> DRAM[DRAM]
    Cache --> WB[write buffer]
    WB --> DRAM

```

## Sources of Cache Misses

- **Compulsory** (初期参照ミス, cold start or process migration, first reference):
  - First access to a block, “cold” fact of life, not a whole lot you can do about it
  - If you are going to run “millions” of instruction, compulsory misses are insignificant
- **Conflict** (競合性ミス, collision):
  - Multiple memory locations mapped to the same cache location
  - Solution 1: increase cache size
  - Solution 2: increase **associativity**
- **Capacity** (容量性ミス):
  - Cache cannot contain all blocks accessed by the program
  - Solution: increase cache size

7

## Handling Cache Misses

- **Read misses (I\$ and D\$)**
  - **stall** (ストール,立ち往生させる) the entire pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume
- **Write misses (D\$ only)**
  - **Write allocate**
    - (a) write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall
    - (b) **stall** the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache, write the word from the processor to the cache, then let the pipeline resume
  - **No-write allocate** – skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full; must invalidate **the cache block** since it will be **inconsistent**

8

## Miss Rate vs Block Size vs Cache Size



- Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller

9

## Block Size Tradeoff

- Larger block sizes take advantage of spatial locality **but**
  - If the block size is too big relative to the cache size, the miss rate will go up
  - Larger block size means larger miss penalty
    - Latency to first word in block + transfer time for remaining words



□ In general, **Average Memory Access Time**  

$$= \text{Hit Time} + \text{Miss Penalty} \times \text{Miss Rate}$$

10

## Reducing Cache Miss Rates, **associativity**

- **Allow more flexible block placement**
  - In a **direct mapped cache** a memory block maps to exactly **one cache block**
  - At the other extreme, could allow a memory block to be mapped to any cache block – **fully associative cache**
- A compromise is to divide the cache into **sets** each of which consists of **n “ways” (n-way set associative)**. A memory block maps to a unique set and can be placed in any way of that set (so there are **n choices**)

11

## Caching: Direct mapped (First Example)

### Main Memory

Two low order bits define the byte in the word (32-b words)

Q2: How do we find it?

Use **next 2 low order memory address bits** – the **index** – to determine which cache block



12



### Benefits of Set Associative Caches

- The choice of direct mapped or set associative depends on the **cost of a miss** versus the **cost of implementation**

Miss Rate

Associativity

4KB  
8KB  
16KB  
32KB  
64KB  
128KB  
256KB  
512KB

Data from Hennessy & Patterson, Computer Architecture, 2003

- Largest gains are in going from **direct mapped** to **2-way**

19

### Reducing Cache Miss Rates by multiple levels

- Enough room on the die for **bigger L1 caches** or for a **second level of caches** – normally a **unified L2 cache** (i.e., it holds both instructions and data) and in some cases even a **unified L3 cache**
- For our example,
  - CPI<sub>ideal</sub> of 2,
  - 100 cycle miss penalty (to main memory),
  - 36% load/stores,
  - a 2% (4%) L1\$ (D\$) miss rate,
  - add a UL2\$ that has a 25 cycle miss penalty and a 0.5% miss rate**

$$CPI_{stalls} = 2 + .02 \times 25 + .36 \times .04 \times 25 + .005 \times 100 + .36 \times .005 \times 100 = 3.54$$

(as compared to 5.44 with no L2\$)

20

### Multilevel Cache Design Considerations

- Design considerations for L1 and L2 caches are very different
  - Primary cache should focus on **minimizing hit time** in support of a shorter clock cycle
  - Secondary cache should focus on **reducing miss rate** to reduce the penalty of long main memory access times
- The miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache – so it can be smaller (i.e., faster) but have a higher miss rate
- For the L2 cache, hit time is less important than miss rate
  - The L2\$ hit time determines L1\$'s miss penalty

21

### Key Cache Design Parameters

|                       | L1 typical  | L2 typical      |
|-----------------------|-------------|-----------------|
| Total size (blocks)   | 250 to 2000 | 4000 to 250,000 |
| Total size (KB)       | 16 to 64    | 500 to 8000     |
| Block size (B)        | 32 to 64    | 32 to 128       |
| Miss penalty (clocks) | 10 to 25    | 100 to 1000     |
| Miss rates            | 2% to 5%    | 0.1% to 2%      |

22

### Two Machines' Cache Parameters

|                  | Intel P4                                 | AMD Opteron                  |
|------------------|------------------------------------------|------------------------------|
| L1 organization  | Split I\$ and D\$                        | Split I\$ and D\$            |
| L1 cache size    | 8KB for D\$, 96KB for trace cache (~I\$) | 64KB for each of I\$ and D\$ |
| L1 block size    | 64 bytes                                 | 64 bytes                     |
| L1 associativity | 4-way set assoc.                         | 2-way set assoc.             |
| L1 replacement   | ~ LRU                                    | LRU                          |
| L1 write policy  | write-through                            | write-back                   |
| L2 organization  | Unified                                  | Unified                      |
| L2 cache size    | 512KB                                    | 1024KB (1MB)                 |
| L2 block size    | 128 bytes                                | 64 bytes                     |
| L2 associativity | 8-way set assoc.                         | 16-way set assoc.            |
| L2 replacement   | ~ LRU                                    | ~ LRU                        |
| L2 write policy  | write-back                               | write-back                   |

23

### 先端マイクロプロセッサ Intel Montecito

- 2 個のEPICプロセッサコア
- 1MB L2, 12MB L3キャッシュ
- EPICコアは11 issue, 2way Temporal MT
- 初の10億超トランジスタ
  - 1.72BTrs
  - 21.5mm x 27.7mm
  - 90nm
  - 100W
- パワー制御用の専用チップ Foxtonを搭載

Figure 10.1.7

Source: ISSCC 2005 papers

24

