EC Sistem Komputer. Bagian 10 Cache Memory

EC33 - Sistem Komputer Bagian 1 Cache Memory Departemen Teknik Elektro Institut Teknologi Bandung 25

Pembahasan Organisasi cache memory Direct mapped cache Set associative cache Pengaruh cache pada kinerja komputer Cache Memory 1-2

Cache Memory Cache memory adalah memori berbasis SRAM berukuran kecil dan berkecepatan tinggi yang dikendalikan secara otomatis oleh hardware. Menyimpan suatu blok dari main memory yang sering diakses oleh CPU Dalam operasinya, pertama-tama CPU akan mencari data di L1, kemudian di L2, dan main memory. CPU register file L1 ALU cache cache bus system bus memory bus L2 cache bus interface I/O bridge main memory Cache Memory 1-3

L1 Cache Satuan transfer antara register dan cache dalam blok berukuran 4-byte baris baris 1 Satuan transfer antara cache dan main memory dalam blok berukuran 4-word blok 1 blok 21 blok 3 a b c d... p q r s... w x y z... register dalam CPU memiliki tempat untuk menyimpan empat word berukuran 4-byte. L1 cache memiliki tempat untuk menyimpan dua blok berukuran 4-word main memory memiliki tempat untuk meyimpan blok-blok berukuran 4-word Cache Memory 1-4

Organisasi Cache Memory Cache adalah array dari kumpulan set. Setiap set berisi satu atau lebih baris. Setiap baris menyimpan satu blok data. set : 1 valid bit per baris valid valid valid t bit per baris B = 2 b byte per blok cache 1 1 1 B 1 B 1 B 1 E baris per set S = 2 s set set 1: valid 1 B 1 valid 1 B 1 Ukuran cache : C = B x E x S byte data set S-1: valid 1 B 1 Cache Memory 1-5

Pengalamatan Cache Alamat A: t bit s bit b bit set : v v 1 1 B 1 B 1 m-1 <> <set index> <block offset> set 1: set S-1: v v v v 1 1 1 1 B 1 B 1 B 1 B 1 Data word pada alamat A berada dalam cache jika bit <> dan <set index> cocok dan berada dalam baris yang valid. Isi word dimulai dari byte ofset <block offset> pada awal blok Cache Memory 1-6

Direct-Mapped Cache Cache yang sederhana Setiap set hanya memiliki satu baris (line) set : valid blok cache E=1 baris per set set 1: valid blok cache set S-1: valid blok cache Cache Memory 1-7

Mengakses Direct-Mapped Cache Memilih set Menggunakan bit set index untuk memilih set yang digunakan set : valid blok cache set dipilih set 1: valid blok cache m-1 t bit s bit b bit 1 set index block offset set S-1: valid blok cache Cache Memory 1-8

Mengakses Direct-Mapped Cache Pencocokan baris dan pemilihan word Pencocokan baris : mencari baris valid dalam set yang dipilih dengan mencocokan Pemilihan word : Selanjutnya mengekstraksi word =1? (1) Bit valid harus di-set Set dipilih (i): 1 1 2 3 4 5 6 7 11 w w 1 w 2 w 3 (2) Bit pada cache harus cocok dengan bit pada alamat m-1 =? t bit 11 s bit b bit i 1 set index block offset (3) Jika (1) dan (2), maka cache hit, dan block offset memilih posisi awal byte Cache Memory 1-9

Simulasi Direct-Mapped Cache t=1 s=2 b=1 x xx x M=16 byte alamat, B=2 byte/blok, S=4 set, E=1 entri/set Penelusuran alamat (baca): [ 2 ], 1 [1 2 ], 13 [111 2 ], 8 [1 2 ], [ 2 ] [ 2 ] (miss) v data 1 m[1] M[-1] m[] 13 [111 2 ] (miss) v data 1 m[1] M[-1] m[] (1) (3) 1 1 M[12-13] 1 1 m[13] m[12] 8 [1 2 ] (miss) v data 1 1 m[9] M[8-9] m[8] [ 2 ] (miss) v data 1 m[1] M[-1] m[] (4) 1 1 M[12-13] (5) 1 1 M[12-13] 1 1 m[13] m[12] Cache Memory 1-1

Bit Tengah Sebagai Indeks 1 1 11 4-baris cache Bit atas Bit tengah 1 1 1 1 11 11 1 1 11 11 11 11 111 111 1 1 11 11 11 11 111 111 11 11 111 111 111 111 1111 1111 Bit indeks orde tinggi Baris memori yang bersebelahan akan dipetakan pada lokasi cache sama Spatial locality yang buruk Bit indeks orde tengah Baris memori yang berurutan dipetakan pada baris cache berbeda Dapat menyimpan urutan byte pada cache dalam satu waktu Cache Memory 1-11

Set Associative Cache Setiap set memiliki lebih dari satu baris set : valid blok cache valid blok cache E=2 baris per set set 1: set S-1: valid blok cache valid blok cache valid blok cache valid blok cache Cache Memory 1-12

Mengakses Set Associative Cache Memilih set Serupa dengan direct-mapped cache set : valid valid blok cache blok cache Set dipilih set 1: valid valid blok cache blok cache m-1 t bit s bit 1 b bit set index block offset set S-1: valid valid blok cache blok cache Cache Memory 1-13

Mengakses Set Associative Cache Pencocokan baris dan pemilihan word Harus membandingkan setiap pada baris yang valid dalam set yang dipilih =1? (1) Bit valid harus di-set. 1 2 3 4 5 6 7 Set dipilih (i): 1 11 1 11 w w 1 w 2 w 3 (2) Bit pada salah satu baris cache harus cocok dengan bit pada alamat m-1 =? t bit 11 s bit i b bit 1 set index block offset (3) Jika (1) dan (2), maka cache hit, dan block offset memilih posisi awal byte Cache Memory 1-14

Multi-Level Cache Pada cache, data dan instruksi dapat dipisah atau diletakkan dalam tempat yang sama Prosesor Reg L1 d-cache L1 i-cache Unified Unified L2 L2 Cache Cache Memori disk disk Ukuran : Kecepatan : $/Mbyte: Baris: 2 B 3 ns 8 B 8-64 KB 3 ns 32 B 1-4MB SRAM 6 ns $1/MB 32 B 128 MB DRAM 6 ns $1.5/MB 8 KB 3 GB 8 ms $.5/MB Bertambah besar, lambat dan murah Cache Memory 1-15

Hirarki Cache Intel Pentium Reg. L1 Data 1 cycle latency 16 KB 4-way assoc Write-through 32B lines L1 Instruction 16 KB, 4-way 32B lines L2 L2 Unified Unified 128KB--2 128KB--2 MB MB 4-way 4-way assoc assoc Write-back Write-back Write Write allocate allocate 32B 32B lines lines Main Main Memory Memory Hingga Hingga 4GB 4GB Chip Chip Prosesor Prosesor Cache Memory 1-16

Metrik Kinerja Cache Miss Rate Persentase referensi memori yang tidak ditemukan dalam cache (miss/referensi). Hit Time Umumnya 3-1% untuk L1, < 1% untuk L2. Waktu untuk mengirimkan data dari cache ke prosesor (termasuk waktu untuk menentukan apakah data tersebut terdapat dalam cache). Umumnya 1 siklus clock untuk L1, 3-8 siklus clock untuk L2. Miss Penalty Waktu tambahan yang diperlukan karena terjadi miss Umumnya 25-1 siklus untuk main memory. Cache Memory 1-17

Menulis Kode yg Cache Friendly Kode yang baik : Melakukan referensi berulang-ulang terhadap suatu variabel (temporal locality) Pola referensi stride-1 (spatial locality) Contoh : cold cache, 4-byte words, 4-word cache blocks int sumarrayrows(int a[m][n]) { int i, j, sum = ; int sumarraycols(int a[m][n]) { int i, j, sum = ; for (i = ; i < M; i++) for (j = ; j < N; j++) sum += a[i][j]; return sum; for (j = ; j < N; j++) for (i = ; i < M; i++) sum += a[i][j]; return sum; Miss rate = 1/4 = 25% Miss rate = 1% Cache Memory 1-18

Gunung Memori Membaca throughput (membaca bandwidth) Banyaknya byte yang terbaca dari memori setiap detik (MB/detik) Gunung memori Ukuran throughput sebagai fungsi dari spatial locality dan temporal locality. Cara untuk menentukan kinerja sistem memori. Cache Memory 1-19

Fungsi Tes Gunung Memori /* The test function */ void test(int elems, int stride) { int i, result = ; volatile int sink; for (i = ; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, ); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ Cache Memory 1-2

Rutin Utama Gunung Memori /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 1) /* Working set size ranges from 1 KB */ #define MAXBYTES (1 << 23) /*... up to 8 MB */ #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ #define MAXELEMS MAXBYTES/sizeof(int) int data[maxelems]; /* The array we'll be traversing */ int main() { int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz)); printf("\n"); exit(); Cache Memory 1-21

Gunung Memori read throughput (MB/s) Kemiringan untuk Spatial Locality 12 1 8 6 4 2 s1 s3 stride (words) s5 s7 s9 s11 s13 mem s15 8m xe L2 2m 512k 128k 32k L1 8k 2k Pentium III Xeon 55 MHz 16 KB on-chip L1 d-cache 16 KB on-chip L1 i-cache 512 KB off-chip unified L2 cache working set size (bytes) Punggung gunung memperlihatkan Temporal Locality Cache Memory 1-22

Punggung Gunung - Temporal Potongan gunung memori dengan stride=1 Memperlihatkan throughput dari cache dan memori berbeda. 12 1 main memory region L2 cache region L1 cache region 8 6 4 2 8m 4m 2m 124k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k read througput (MB/s) working set size (bytes) Cache Memory 1-23

Kemiringan Spatial Locality Potongan pada gunung memori dengan ukuran 256 KB. Memperlihatkan ukuran blok cache 8 7 read throughput (MB/s) 6 5 4 3 2 one access per cache line 1 s1 s2 s3 s4 s5 s6 s7 s8 s9 s1 s11 s12 s13 s14 s15 s16 stride (words) Cache Memory 1-24

Contoh Perkalian Matriks Pengaruh utama cache yang penting : Total ukuran cache Memperlihatkan temporal locality, tetap mempertahankan working set tetap kecil (contoh : dengan menggunakan blocking) Ukuran blok Memperlihatkan spatial locality /* /* ijk ijk */ */ Deskripsi : Perkalian matriks NxN Total operasi O(N3) Akses N pembacaan untuk setiap elemen sumber N nilai dijumlahkan untuk setiap tujuan Dapat disimpan di register for for (i=; (i=; i<n; i<n; i++) i++) { for for (j=; (j=; j<n; j<n; j++) j++) { sum sum =.;.; for for (k=; (k=; k<n; k<n; k++) k++) sum sum += += a[i][k] a[i][k] * b[k][j]; b[k][j]; c[i][j] c[i][j] = sum; sum; Cache Memory 1-25

Analisis Miss Rate Analisis miss rate pada perkalian matriks Asumsi : Ukuran baris = 32B (cukup besar untuk 4 buah 64-bit word) Dimensi matriks (N) sangat besar Aproksimasi 1/N sama dengan. Cache tidak terlalu besar untuk menyimpan beberapa baris. Metoda analisis : Melihat pola akses pada loop bagian dalam. k j j i k i A B C Cache Memory 1-26

Layout Array C dalam Memori Array C dialokasikan dalam urutan row-major Setiap baris (row) terletak dalam memori yang berurutan Berpindah antar kolom dalam satu baris : for (i = ; i < N; i++) sum += a[][i]; Mengakses elemen yang berurutan Jika ukuran blok (B) > 4 bytes, eksploit spatial locality miss rate = 4 bytes / B Berpindah antar baris dalam satu kolom : for (i = ; i < n; i++) sum += a[i][]; Mengakses elemen yang jauh Tidak terjadi spatial locality! miss rate = 1 (i.e. 1%) Cache Memory 1-27

Perkalian Matriks ijk /* /* ijk ijk */ */ for for (i=; (i=; i<n; i<n; i++) i++) { for for (j=; (j=; j<n; j<n; j++) j++) { sum sum =.;.; for for (k=; (k=; k<n; k<n; k++) k++) sum sum += += a[i][k] * b[k][j]; c[i][j] = sum; sum; Loop bagian dalam : (*,j) (i,j) (i,*) A B C Baris Kolom Tetap Miss pada setiap iterasi loop bagian dalam : A B C.25 1.. Cache Memory 1-28

Perkalian Matriks jik /* /* jik jik */ */ for for (j=; (j=; j<n; j<n; j++) j++) { for for (i=; (i=; i<n; i<n; i++) i++) { sum sum =.;.; for for (k=; (k=; k<n; k<n; k++) k++) sum sum += += a[i][k] * b[k][j]; c[i][j] = sum sum Loop bagian dalam : (*,j) (i,j) (i,*) A B C Baris Kolom Tetap Miss pada setiap iterasi loop bagian dalam : A B C.25 1.. Cache Memory 1-29

Perkalian Matriks kij /* /* kij kij */ */ for for (k=; (k=; k<n; k<n; k++) k++) { for for (i=; (i=; i<n; i<n; i++) i++) { r = a[i][k]; for for (j=; (j=; j<n; j<n; j++) j++) c[i][j] += += r * b[k][j]; Loop bagian dalam : (i,k) (k,*) A B C Tetap Baris Kolom (i,*) Miss pada setiap iterasi loop bagian dalam : A B C..25.25 Cache Memory 1-3

Perkalian Matriks ikj /* /* ikj ikj */ */ for for (i=; (i=; i<n; i<n; i++) i++) { for for (k=; (k=; k<n; k<n; k++) k++) { r = a[i][k]; for for (j=; (j=; j<n; j<n; j++) j++) c[i][j] += += r * b[k][j]; Loop bagian dalam : (i,k) (k,*) A B C Tetap Baris Baris (i,*) Miss pada setiap iterasi loop bagian dalam : A B C..25.25 Cache Memory 1-31

Perkalian Matriks jki /* /* jki jki */ */ for for (j=; (j=; j<n; j<n; j++) j++) { for for (k=; (k=; k<n; k<n; k++) k++) { r = b[k][j]; for for (i=; (i=; i<n; i<n; i++) i++) c[i][j] += += a[i][k] * r; r; Loop bagian dalam : (*,k) (*,j) (k,j) A B C Kolom Tetap Kolom Miss pada setiap iterasi loop bagian dalam : A B C 1.. 1. Cache Memory 1-32

Perkalian Matriks kji /* /* kji kji */ */ for for (k=; (k=; k<n; k<n; k++) k++) { for for (j=; (j=; j<n; j<n; j++) j++) { r = b[k][j]; for for (i=; (i=; i<n; i<n; i++) i++) c[i][j] += += a[i][k] * r; r; Loop bagian dalam : (*,k) (*,j) (k,j) A B C Kolom Tetap Kolom Miss pada setiap iterasi loop bagian dalam : A B C 1.. 1. Cache Memory 1-33

Ringkasan Perkalian Matriks ijk (& jik): 2 load, store miss/iterasi = 1.25 kij (& ikj): 2 load, 1 store miss/iterasi =.5 jki (& kji): 2 load, 1 store miss/iterasi = 2. for (i=; i<n; i++) { for (j=; j<n; j++) { sum =.; for (k=; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; for (k=; k<n; k++) { for (i=; i<n; i++) { r = a[i][k]; for (j=; j<n; j++) c[i][j] += r * b[k][j]; for (j=; j<n; j++) { for (k=; k<n; k++) { r = b[k][j]; for (i=; i<n; i++) c[i][j] += a[i][k] * r; Cache Memory 1-34

Kinerja Perkalian Matriks Pentium Miss rate bukan selalu perkiraan yang baik Penjadwalan kode juga berpengaruh 6 5 Cycles/iteration 4 3 2 kji jki kij ikj jik ijk 1 25 5 75 1 125 15 175 2 225 25 275 3 325 35 375 4 Array size (n) Cache Memory 1-35

Meningkatkan Temporal Locality Meningkatkan temporal locality dengan blocking. Contoh : perkalian matriks dengan blocking blok (di sini) bukan berarti blok cache blok. Tetapi berarti suatu sub-blok dalam matriks. Contoh : N = 8; ukuran sub-blok = 4 A 11 A 12 A 21 A 22 X B 11 B 12 B 21 B 22 = C 11 C 12 C 21 C 22 Ide dasar: Sub-blok (mis., A xy ) dapat diperlakukan seperti skalar C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 Cache Memory 1-36

Perkalian Matriks dengan Blok for (jj=; jj<n; jj+=bsize) { for (i=; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] =.; for (kk=; kk<n; kk+=bsize) { for (i=; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum =. for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; c[i][j] += sum; Cache Memory 1-37

Analisis Perkalian Matriks Blok Pasangan loop paling dalam mengalikan potongan A 1 x bsize dengan blok B bsize x bsize dan mengakumulasikan menjadi C 1 x bsize. Loop dengan j langkah melalui potongan A dan C n baris, memakai B sama. Innermost Loop Pair for (i=; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum =. for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; c[i][j] += sum; kk i A B C Potongan baris diakses update potongan bsize kali blok dipakai n kali elemen berurutan secara berurutan Cache Memory 1-38 kk jj jj i

Kinerja Blocking pada Pentium Kinerja perkalian matriks dengan blocking pada Pentium Blocking (bijk and bikj) meningkatkan kinerja dengan faktor dua kali di atas versi unblocked (ijk and jik) Relatif tidak sensitive terhadap ukuran array. 6 Cycles/iteration 5 4 3 2 1 25 5 75 1 125 15 175 2 Array size (n) 225 25 275 3 325 35 375 4 kji jki kij ikj jik ijk bijk (bsize = 25) bikj (bsize = 25) Cache Memory 1-39

Kesimpulan Pemrogram dalam melakukan optimisasi kinerja cache Bagaimana struktur data dikelola Bagaimana data diakses Struktur nested loop Blocking merupakan teknik umum Seluruh sistem menyukai cache friendly code Memperoleh kinerja optimum absolut sangat tergantung pada platform yang digunakan. Ukuran cache, ukuran line, associativities, dll. Keuntungan paling besar dapat diperoleh dengan kode generik Tetap bekerja dalam working set yang kecil (temporal locality) Gunakan stride yang kecil (spatial locality) Cache Memory 1-4