Fiscal Year 2023

Ver. 2024-01-25a

Course number: CSC.T433 School of Computing, Graduate major in Computer Science

# Advanced Computer Architecture

## 11. Thread Level Parallelism: Interconnection Network

www.arch.cs.titech.ac.jp/lecture/ACA/ Room No.W834, Lecture (Face-to-face) Mon 13:30-15:10, Thr 13:30-15:10

Kenji Kise, Department of Computer Science kise \_at\_ c.titech.ac.jp

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKY TECH

## Sample of a wrong parallel program using pthread

```
% gcc main1.c -00 -lpthread -lm -o a.out1
                                                                                      Single Program Multiple Data (SPMD)
    % ./a.out1
    main: 20000000
                                #include <stdio.h>
                                                                                      #include <stdio.h>
                                #include <pthread.h>
                                                                                      #include <pthread.h>
#include <stdio.h>
                                                                                      #define N 10000000
                                 #define N 10000000
                                                       // ten million
                                                                                                            // ten million
#include <pthread.h>
#define N 10000000
                                int a = 0;
                                                                                      int a = 0;
int a = 0;
                                int func1(){
                                                                                      int func1(){
                                  int i;
                                                                                        int i;
int func1(){
                                  for(i=0; i<N; i++){ a++; }</pre>
                                                                                        for(i=0; i<N; i++){ a++; }</pre>
 int i:
                                };
                                                                                      };
 for(i=0; i<N; i++){ a++; }</pre>
};
                                int func2(){
                                                                                      int main(){
                                  int i;
                                                                                        pthread t t1, t2;
int func2(){
                                  for(i=0; i<N; i++){ a++; }</pre>
                                                                                        pthread create(&t1, NULL, (void *)func1, NULL);
 int i:
                                                                                        pthread create(&t2, NULL, (void *)func1, NULL);
                                };
 for(i=0; i<N; i++){ a++; }</pre>
};
                                int main(){
                                                                                        pthread join(t1, NULL);
                                  pthread t t1, t2;
                                                                                        pthread join(t2, NULL);
int main(){
                                  pthread create(&t1, NULL, (void *)func1, NULL);
 func1();
                                  pthread create(&t2, NULL, (void *)func2, NULL);
                                                                                        printf("main: %d¥n", a);
 func2();
                                                                                        return 0;
                                  pthread join(t1, NULL);
                                                                                      }
  printf("main: %d¥n", a);
                                  pthread join(t2, NULL);
  return 0;
                                  printf("main: %d¥n", a);
                                  return 0;
     main1.c
                                }
     sequential program
                                       main2.c
                                                                                              main3.c
                                       parallel program with func1 and func2
                                                                                              parallel program with func1
```

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

## Sample of some parallel programs using pthread

| main: 20000000                                                                                                                                            | <pre>#include <stdio.h>    #include <pthread.h></pthread.h></stdio.h></pre>                                        | <pre>#include <stdio.h> #include <pthread.h></pthread.h></stdio.h></pre>   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|
| #include <stdio.h><br/>#include <pthread.h></pthread.h></stdio.h>                                                                                         | #define N 10000000 // ten million                                                                                  | #define N 10000000 // ten million                                          |
| #define N 10000000                                                                                                                                        | int a = 0;                                                                                                         | <pre>int a = 0; pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;</pre>       |
| int a = 0;                                                                                                                                                | <pre>int func1(){     int i;</pre>                                                                                 | <pre>int func1(){</pre>                                                    |
| <pre>int func1(){</pre>                                                                                                                                   | <pre>for(i=0; i<n; a++;="" i++){="" pre="" }<=""></n;></pre>                                                       | int i;                                                                     |
| int i;<br>for(i=0; i <n; a++;="" i++){="" td="" }<=""><td>};</td><td><pre>for(i=0; i<n; i++){="" pre="" pthread_mutex_lock(&m);<=""></n;></pre></td></n;> | };                                                                                                                 | <pre>for(i=0; i<n; i++){="" pre="" pthread_mutex_lock(&m);<=""></n;></pre> |
| };                                                                                                                                                        | <pre>int main(){     pthread_t t1, t2;</pre>                                                                       | a++;<br>pthread_mutex_unlock(&m);                                          |
| <pre>int func2(){     int i;     for(i=0; i<n; a++;="" i++){="" pre="" }<=""></n;></pre>                                                                  | <pre>pthread_create(&amp;t1, NULL, (void *)func1, NULL); pthread_create(&amp;t2, NULL, (void *)func1, NULL);</pre> | };                                                                         |
| };                                                                                                                                                        | <pre>pthread_join(t1, NULL); pthread_join(t2, NULL);</pre>                                                         | <pre>int main(){     pthread_t t1, t2;</pre>                               |
| <pre>int main(){</pre>                                                                                                                                    |                                                                                                                    | <pre>pthread_create(&amp;t1, NULL, (void *)func1, NULL);</pre>             |
| func1();<br>func2();                                                                                                                                      | <pre>printf("main: %d¥n", a); return 0;</pre>                                                                      | <pre>pthread_create(&amp;t2, NULL, (void *)func1, NULL);</pre>             |
| printf("main: %d¥n", a);<br>return 0;                                                                                                                     | }                                                                                                                  | <pre>pthread_join(t1, NULL); pthread_join(t2, NULL);</pre>                 |
| }                                                                                                                                                         |                                                                                                                    | <pre>printf("main: %d¥n", a); return 0;</pre>                              |
| main1.c<br>sequential program                                                                                                                             |                                                                                                                    | }                                                                          |
| sequential program                                                                                                                                        | main3.c<br>parallel program with func1                                                                             | main4.c<br>parallel program with func1                                     |

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

#### Shared memory many-core architecture

- The single-chip integrates many cores (conventional processors) and an interconnection network.
- The shared memory or shared address space (SAS) is used as a means for communication between the processors.



Intel Skylake-X, Core i9-7980XE, 2017



## The free lunch is over

- Programmers have to worry much about performance and concurrency
- Parallel programming & multi-processor (multi-core) architectures



#### Free Lunch

Programmers haven't really had to worry much about performance or concurrency because of Moore's Law

Why we did not see 4GHz processors in Market?

The traditional approach to application performance was to simply wait for the next generation of processor; most software developers did not need to invest in performance tuning, and enjoyed a "free lunch" from hardware improvements.



The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software by Herb Sutter, 2005

# Parallel programming

• Several dependent threads run at the same time on a multi-processor (many-core) system.



## Four steps in creating a parallel program

- 0. Preparing an optimized sequential program (baseline)
- 1. Decomposition of computation in tasks
- 2. Assignment of tasks to processes
- 3. Orchestration of data access, comm, synch.
- 4. Mapping processes to processors (cores)



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

## Simulating ocean currents





(a) Cross sections

(b) Spatial discretization of a cross section

- Model as two-dimensional grids
  - Discretize in space and time
  - finer spatial and temporal resolution enables greater accuracy
- Many different computations per time step
  - Concurrency across and within grid computations
- We use one-dimensional grids for simplicity

## Sequential version as the baseline

- A sequential program main 5.c and the execution result
- Computations in blue color are fully parallel

```
/* the number of grids */
#define N 8
#define TOL 15.0 /* tolerance parameter */
float A[N+2], B[N+2];
void solve () {
    int i, done = 0;
    while (!done) {
        float diff = 0.0;
        for (i=1; i<=N; i++) {</pre>
            B[i] = 0.333 * (A[i-1] + A[i] + A[i+1]);
            diff = diff + fabsf(B[i] - A[i]);
        }
        if (diff <TOL) done = 1;</pre>
        for (i=1; i<=N; i++) A[i] = B[i];</pre>
        for (i=0; i<=N+1; i++) printf("%6.2f ", B[i]);</pre>
        printf("| diff=%6.2f¥n", diff); /* for debug */
}
int main() {
    int i;
    for (i=1; i<N-1; i++) A[i] = 100+i*i;
    for (i=0; i<=N+1; i++) printf("%6.2f ", A[i]);</pre>
    printf("¥n");
    solve();
}
```

| 0.00 | 101.00 | 104.00 | 109.00 | 116.00 | 125.00 | 136.00 | 0.00  | 0.00  | 0.00 |             |
|------|--------|--------|--------|--------|--------|--------|-------|-------|------|-------------|
| 0.00 | 68.26  | 104.56 | 109.56 | 116.55 | 125.54 | 86.91  | 45.29 | 0.00  | 0.00 | diff=129.32 |
| 0.00 | 57.55  | 94.03  | 110.11 | 117.10 | 109.56 | 85.83  | 44.02 | 15.08 | 0.00 | diff= 55.76 |
| 0.00 | 50.48  | 87.15  | 106.97 | 112.14 | 104.06 | 79.72  | 48.26 | 19.68 | 0.00 | diff= 42.50 |
| 0.00 | 45.83  | 81.45  | 101.99 | 107.62 | 98.54  | 77.27  | 49.17 | 22.63 | 0.00 | diff= 31.68 |
| 0.00 | 42.38  | 76.35  | 96.92  | 102.61 | 94.38  | 74.92  | 49.64 | 23.91 | 0.00 | diff= 26.88 |
| 0.00 | 39.54  | 71.81  | 91.87  | 97.87  | 90.55  | 72.91  | 49.44 | 24.49 | 0.00 | diff= 23.80 |
| 0.00 | 37.08  | 67.67  | 87.10  | 93.34  | 87.02  | 70.89  | 48.90 | 24.62 | 0.00 | diff= 22.12 |
| 0.00 | 34.88  | 63.89  | 82.62  | 89.06  | 83.67  | 68.87  | 48.09 | 24.48 | 0.00 | diff= 21.06 |
| 0.00 | 32.89  | 60.40  | 78.44  | 85.03  | 80.45  | 66.81  | 47.10 | 24.17 | 0.00 | diff= 20.26 |
| 0.00 | 31.07  | 57.19  | 74.55  | 81.23  | 77.35  | 64.72  | 45.98 | 23.73 | 0.00 | diff= 19.47 |
| 0.00 | 29.39  | 54.21  | 70.92  | 77.63  | 74.36  | 62.62  | 44.77 | 23.21 | 0.00 | diff= 18.70 |
| 0.00 | 27.84  | 51.46  | 67.52  | 74.23  | 71.47  | 60.52  | 43.49 | 22.64 | 0.00 | diff= 17.95 |
| 0.00 | 26.41  | 48.89  | 64.34  | 71.00  | 68.67  | 58.43  | 42.17 | 22.02 | 0.00 | diff= 17.23 |
| 0.00 | 25.07  | 46.50  | 61.35  | 67.94  | 65.97  | 56.37  | 40.84 | 21.38 | 0.00 | diff= 16.53 |
| 0.00 | 23.83  | 44.26  | 58.54  | 65.02  | 63.36  | 54.34  | 39.49 | 20.72 | 0.00 | diff= 15.85 |
| 0.00 | 22.68  | 42.17  | 55.88  | 62.24  | 60.85  | 52.34  | 38.14 | 20.05 | 0.00 | diff= 15.20 |
| 0.00 | 21.59  | 40.20  | 53.38  | 59.60  | 58.42  | 50.39  | 36.81 | 19.38 | 0.00 | diff= 14.58 |
|      |        |        |        |        |        |        |       |       |      |             |





CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

### Decomposition and assignment

- Single Program Multiple Data (SPMD)
  - Decomposition: there are eight tasks to compute B[]
  - Assignment: the first four tasks for core 1, and the last four tasks for core 2



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

### Orchestration

LOCK and UNLOCK around critical section Lock provides exclusive access to the locked data. Set of operations we want to execute atomically **BARRIFR** ensures all reach here float A[N+2], B[N+2]; /\* these are in shared memory \*/ float diff=0.0; /\* variable in shared memory \*/ int ncores = 2;These operations must be executed pthread mutex t m = PTHREAD MUTEX INITIALIZER; atomically pthread\_barrier\_t barrier; void solve pp (int pid) { (1) load diff int i, done = 0; /\* private variables \*/ int mymin = 1 + (pid \* N/ncores); /\* private variable \*/ (2) add int mymax = mymin + N/ncores - 1; /\* private variable \*/ (3) store diff while (!done) { float mydiff = 0; for (i=mymin; i<=mymax; i++) {</pre> B[i] = 0.333 \* (A[i-1] + A[i] + A[i+1]);mydiff = mydiff + fabsf(B[i] - A[i]); After all cores update the diff, } if statement must be executed. pthread mutex lock(&m); diff = diff + mydiff; if (diff <TOL) done = 1;</pre> pthread mutex unlock(&m); pthread barrier wait(&barrier) if (diff <TOL) done = 1;</pre> pthread barrier wait(&barrier); if (pid==1) diff = 0.0; for (i=mymin; i<=mymax; i++) A[i] = B[i];</pre> pthread barrier wait(&barrier);

## Parallel program after orchestration

% gcc main6.c -O0 -lpthread -lm -o a.out6

```
#include <stdio.h>
#include <math.h>
#include <pthread.h>
#define N 8
                      /* the number of grids */
                      /* tolerance parameter */
#define TOL 15.0
float A[N+2], B[N+2]; /* these are in shared memory */
float diff=0.0;
                      /* variable in shared memory */
int ncores = 2;
pthread mutex t m = PTHREAD MUTEX INITIALIZER;
pthread barrier t barrier;
int main(){
    pthread_t t1, t2;
    int pid0 = 0;
    int pid1 = 1;
    for (int i=1; i<N-1; i++) A[i] = 100+i*i;</pre>
    pthread barrier init(&barrier, NULL, ncores);
    pthread create(&t1, NULL, (void *)solve pp, (void*)&pid0);
    pthread create(&t2, NULL, (void *)solve pp, (void*)&pid1);
    pthread_join(t1, NULL);
    pthread join(t2, NULL);
    for (int i=0; i<=N+1; i++) printf("%6.2f ", B[i]);</pre>
    printf("¥n");
    return 0;
```

```
void solve pp (void *p) {
   int pid = *(int *)p;
   int i, done = 0;
                                        /* private variables */
                                       /* private variable */
   int mymin = 1 + (pid * N/ncores);
   int mymax = mymin + N/ncores - 1; /* private variable */
   while (!done) {
        float mydiff = 0.0;
        for (i=mymin; i<=mymax; i++) {</pre>
            B[i] = 0.333 * (A[i-1] + A[i] + A[i+1]);
            mydiff = mydiff + fabsf(B[i] - A[i]);
        pthread mutex lock(&m);
        diff = diff + mydiff;
        pthread mutex unlock(&m);
        pthread barrier wait(&barrier);
        if (diff <TOL) done = 1;
        pthread barrier wait(&barrier);
        if (pid==1) diff = 0.0;
        for (i=mymin; i<=mymax; i++) A[i] = B[i];</pre>
        pthread barrier wait(&barrier);
   }
```



main6.c parallel program

}

## Key components of many-core processors

- Interconnection network
  - connecting many modules on a chip achieving high throughput and low latency
- Main memory and caches
  - Caches are used to reduce latency and to lower network traffic
  - A parallel program has private data and shared data
  - New issues are cache coherence and memory consistency

• Core

 High-performance superscalar processor providing a hardware mechanism to support thread synchronization



## Key components of many-core processors

- Interconnection network
  - connecting many modules on a chip achieving high throughput and low latency
- Main memory and caches
  - Caches are used to reduce latency and to lower network traffic
  - A parallel program has private data and shared data
  - New issues are cache coherence and memory consistency
- Core
  - High-performance superscalar processor providing a hardware mechanism to support thread synchronization



## Performance metrics of interconnection network

- Network cost
  - number of links on a switch to connect to the network (plus one link to connect to the processor)
  - width in bits per link, length of link
- Network bandwidth (NB)
  - represents the best case
  - bandwidth of each link x number of links
- Bisection bandwidth (BB)
  - represents the worst case
  - divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line



#### **Bus Network**

- N cores ( ), N switch ( ), 1 link (the bus)
- Only 1 simultaneous transfer at a time
  - NB (best case) = link (bus) bandwidth x 1
  - BB (worst case) = link (bus) bandwidth x 1
- All processors can snoop the bus

B





F

The case where core B sends a packet to someone



Ε

## Exercise 1

- Bus Network with multiplexer (mux)
- one N-input mux for N cores
- Draw the bus network organization of 4 cores using a 4input mux.

## **Ring Network**

- N cores, N switches, 2 links/switch, N links
- N simultaneous transfers
  - NB (best case) = link bandwidth x N
  - BB (worst case) = link bandwidth x 2
- If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case



## Cell Broadband Engine (2005)

- Cell Broadband Engine (2005)
  - 8 core (SPE) + 1 core (PPE)
    - each SPE has 256KB memory
  - PS3, IBM Roadrunner (12k cores)



PlayStation3 from PlaySation.com (Japan)



IEEE Micro, Cell Multiprocessor Communication Network: Built for Speed



Diagram created by IBM to promote the CBEP, ©2005 from WIKIPEDIA

## Intel Xeon Phi (2012)



#### Table 2. Intel® Xeon Phi™ Product Family Specifications

MEMORY FORM PEAK DOUBLE PEAK INTEL\* CAPACITY PRODUCT FACTOR &, BOARD FREQUENCY MEMORY TURBO NUMBER PRECISION (GB) NUMBER THERMAL TDP (WATTS) OF CORES (GHz) PERFORMANCE BANDWIDTH BOOST SOLUTION<sup>4</sup> TECHNOLOGY (GFLOP) (GB/s) 57 3120P 1.1 N/A PCIe, Passive 300 1003 240 6 57 PCIe, Active 300 1.1 240 6 N/A 3120A 1003 PCIe, Passive 225 60 1.053 5110P 1011 320 8 N/A Dense form 245 60 N/A 5120D 1.053 1011 352 8 factor, None Peak turbo 7110P PCIe, Passive 300 61 1.238 1208 352 16 frequency: 61 7120X PCIe, None 300 1.238 1208 352 16 1.33 GHz



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Block Diagram



## Fat Tree (1)

- Trees are good structures. People in CS use them all the time. Suppose we wanted to make a tree network.
- Any time A wants to send to C, it ties up the upper links, so that B can't send to D.
  - The bisection bandwidth on a tree is horrible 1 link, at all times
- The solution is to 'thicken' the upper links.
  - More links as the tree gets thicker increases the bisection bandwidth



## Fat Tree

- N cores, log(N-1) x logN switches, 2 up + 4 down = 6 links/switch, N x logN links
- N simultaneous transfers
  - NB = link bandwidth  $x N \log N$
  - BB = link bandwidth x 4



## Crossbar (Xbar) Network

- N cores, N<sup>2</sup> switches (unidirectional), 2 links/switch, N<sup>2</sup> links
- N simultaneous transfers
  - NB = link bandwidth  $\times N$  (best case)
  - BB = link bandwidth x N (worst case)





bars.<sup>[1]</sup>

#### Crossbar (Xbar) Network with mux

• N N-input multiplexers



#### Mesh Network

- N cores, N switches, 5 links/switch
- N simultaneous transfers
  - NB = link bandwidth x N (best case)
  - BB = link bandwidth  $\times N^{1/2}$  (worst case)



#### 2D and 3D Mesh / Torus Network





## Intel Single-Chip Cloud Computer (2009)

• To research multi-core processors and parallel processing.



#### Epiphany-V: A 1024 core 64-bit RISC SoC (2016)

RISC CPU NOC

MEMORY

NOC



Summary of Epiphany-V features:

- 1024 64-bit RISC processors
- 64-bit memory architecture
- + 64/32-bit IEEE floating point support
- 64MB of distributed on-chip memory
- 1024 programmable I/O signals
- Three 136-bit wide 2D mesh NOCs
- 2052 Independent Power Domains
- Support for up to 1 billion shared memory processors
- Binary compatibility with Epiphany III/IV chips

South IO

• Custom ISA extensions for deep learning, communication, and cryptography

| Function           | Value $(mm^2)$ | Share of Total Die Area |  |  |
|--------------------|----------------|-------------------------|--|--|
| SRAM               | 62.4           | 53.3%                   |  |  |
| Register File      | 15.1           | 12.9%                   |  |  |
| FPU                | 11.8           | 10.1%                   |  |  |
| NOC                | 12.1           | 10.3%                   |  |  |
| IO Logic           | 6.5            | 5.6%                    |  |  |
| "Other" Core Stuff | 5.1            | 4.4%                    |  |  |
| IO Pads            | 3.9            | 3.3%                    |  |  |
| Always on Logic    | 0.66           | 0.6%                    |  |  |

Table 5: Epiphany-V Area Breakdown

## Intel Skylake-X, Core i9-7980XE (2017)

- 18 core
- 2D mesh topology





## Intel Xeon Scalable Processor

'his slide under embargo until 1:00 PM PDT June 15, 201'

#### New Mesh Interconnect Architecture

#### Broadwell EX 24-core die



#### Skylake-SP 28-core die

| 2x UPI x 20 | PCle* x16  | PCle x16<br>DMI x 4<br>CRDMA | On Pkg<br>PCle x16 | 1x UPI x 20 | PCle x16       |  |
|-------------|------------|------------------------------|--------------------|-------------|----------------|--|
| CHA/SF/LLC  | CHA/SF/LLC | CHA/SF/LLC                   | CHA/SF/LLC         | CHA/SF/LLC  | CHA/SF/LLC     |  |
| SKX Core    | SKX Core   | SKX Core                     | SKX Core           | SKX Core    | SKX Core       |  |
| DDR4 NC     | CHA/SF/LLC | CHA/SF/LLC                   | CHA/SF/LLC         | CHA/SF/LLC  | MC DDR4        |  |
| DDR 4       | SKX Core   | SKX Core                     | SKX Core           | SKX Core    | DDR 4<br>DDR 4 |  |
| CHA/SF/LLC  | CHA/SF/LLC | CHA/SF/LLC                   | CHA/SF/LLC         | CHA/SF/LLC  | CHA/SF/LLC     |  |
| SKX Core    | SKX Core   | SKX Core                     | SKX Core           | SKX Core    | SKX Core       |  |
| CHA/SF/LLC  | CHA/SF/LLC | CHA/SF/LLC                   | CHA/SF/LLC         | CHA/SF/LLC  | CHA/SF/LLC     |  |
| SKX Core    | SKX Core   | SKX Core                     | SKX Core           | SKX Core    | SKX Core       |  |
| CHA/SF/LLC  | CHA/SF/LLC | CHA/SF/LLC                   | CHA/SF/LLC         | CHA/SF/LLC  | CHA/SF/LLC     |  |
| SKX Core    | SKX Core   | SKX Core                     | SKX Core           | SKX Core    | SKX Core       |  |

CHA – Caching and Home Agent ; SF – Snoop Filter; LLC – Last Level Cache; SKX Core – Skylake Server Core; UPI – Intel® UltraPath Interconnect

#### **MESH IMPROVES SCALABILITY WITH HIGHER BANDWIDTH AND REDUCED LATENCIES**

Intel Press Workshops - June 2017

Content Under Embargo Until 1:00 PM PST June 15, 2017



## Bus vs. Networks on Chip (NoC) of mesh topology



intersection



## Typical NoC architecture of mesh topology

- NoC requirements: low latency, high throughput, low cost
- Packet based data transmission via NoC routers and XY-dimension order routing



## Packet organization (Flit encoding)

- A flit (flow control unit or flow control digit) is a link-level atomic piece that forms a network packet.
  - A packet has one head flit and some body flits.
- For simplicity, assume that a packet has only one flit.
  - Later we see a packet which has some flits.
- Each flit has typical three fields:
  - Payload (data)
  - Route information
  - Virtual channel identifier (VC)

Flit Route info VC Payload





Packet (tag + data)

## Packet organization (Flit encoding)

- A flit (flow control unit or flow control digit) is a link-level atomic piece that forms a network packet.
  - A packet has one head flit and some body flits.
- Each flit has typical three fields:
  - payload(data) or route information(tag)
  - flit type : head, body, tail, etc.
  - virtual channel identifier



Packet (tag + data)



• XY dimension order routing (DOR), and YX DOR





## Simple NoC router architecture

• Routing computation for XY-dimension order





# Simple NoC router architecture

• Buffering and arbitration

NoC router

time stamp based, round robin, etc.







## Simple NoC router architecture

- Flow control (back pressure)
  - When the destination router's input buffer is full, the packet cannot be sent.



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

Ν

PM

## Simple NoC router architecture

- Problem: Head-of-line (HOL) blocking
  - The first (head) packet in the same buffer blocks the movement of subsequent packets.



CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH

Ν

PM

## Two (physical) networks to mitigate HOL?





# Datapath of Virtual Channel (VC) NoC router

• To mitigate head-of-line (HOL) blocking, virtual channels are used



## Bus vs. Networks on Chip (NoC) of mesh topology

#### To mitigate head-of-line (HOL) blocking

Virtual Channel

#### Pipelining the NoC router microarchitecture



# Typical NoC architecture of mesh topology

- NoC requirements: low latency, high throughput, low cost
- Packet based data transmission via NoC routers and XY-dimension order routing



## Bus vs. Networks on Chip (NoC) of mesh topology



Packet (tag + data)





#### Distributed system



#### Average packet latency of mesh NoCs

- 5 stage router pipeline
- Uniform traffic (destination nodes are selected randomly)



Thiem Van Chu, Myeonggu Kang, Shi FA and Kenji Kise: Enhanced Long Edge First Routing Algorithm and Evaluation in Large-Scale Networks-on-Chip, IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, (September 2017).

## Key components of many-core processors

- Interconnection network
  - connecting many modules on a chip achieving high throughput and low latency
- Main memory and caches
  - Caches are used to reduce latency and to lower network traffic
  - A parallel program has private data and shared data
  - New issues are cache coherence and memory consistency
- Core
  - High-performance superscalar processor providing a hardware mechanism to support thread synchronization



#### Bus Network with multiplexer (mux)

• one N-input multiplexer for N cores

