Kebutuhan Komputer Berkinerja Tinggi Pemrosesan Paralel Kudang B. Seminar Peramalan cuaca Aerodinamik Kercerdasan buatan: robotik Rekayasa genetik Contoh aplikasi di atas melibatkan komputasi intensif dan memerlukan daya olah yang tinggi Example 1: Weather Prediction Area, segments 3000*3000*11 cubic miles.1*.1*.1 cubic mile: ~ 10 11 segments Two day prediction half hour periods: ~ 100 periods Computation per segment Temp, Pressure, Humidity, Wind speed, Wind direction Assume ~ 100 FLOPs Performance: Weather Prediction Computational requirement: 10 15 Serial supercomputer: 10 9 instr/sec Total serial time: 10 6 sec = 280 hours Not too good for 48 hour weather prediction 1
Parallel Weather Prediction Example 2: N body problem 1 K workstations, grid connected 10 8 segment computations per processor 10 8 instructions per second 100 instructions per segment computation 100 time steps: 10 4 seconds = ~3 hours Much more acceptable Assumption: Communication not a problem here More workstations: finer grid better accuracy Astronomy: bodies in space Attract each other: Gravitational force Newtons law O(n*n) calculations per snapshot Galaxy: ~ 10 11 bodies -> ~ 10 22 calculations Calculation 1 micro sec Snapshot: 10 16 secs = ~10 11 days = ~ 3*10 8 years Is parallelism going to help us? NO What does help? Better algorithm: Barnes Hut Divides the space in quad tree Treats far away quads as one body Other Challenging Applications Satellite data acquisition: billions of bits / sec Satellite data processing Pollution levels, Remote sensing of materials Image recognition Discrete optimization problems Planning, Scheduling, VLSI design Material modeling Nuclear weapons modeling (ASCI) Airplane/Satellite/Vehicle design Application Specific Architectures Mapping an algorithm directly onto hardware ASICs: Application Specific Integrated Circuits Levels of specificity Full custom ASICs Standard cell ASICs Field programmable gate arrays Computational models Dataflow graphs Systolic arrays Orders of magnitude better performance Orders of magnitude lower power 2
ASICS cont How much faster than General purpose? Example: 1D 1024 FFT General purpose machine (G4): 25 micro secs ASIC device (MIT Lincoln Labs): 32 nano secs ASIC device uses 20 milliwatts (100 * less power) Future designs: 2 tera ops in small ( < cubic ft ) device Target applications FFT Finite Impulse Response (FIR) Filters Matrix multiply QR decomposition Contoh Nyata Peramalan cuaca 24 jam di UK melibatkan sekitar 10 12 operasi untuk dieksekusi. Ini memerlukan waktu 2.7 hours pada mesin Cray-1 (berkemampuan 10 8 operasi per detik). Berapa operasi untuk peramalan mingguan, bulanan, tahunan? Menurut Einstein kecepatan cahaya: 3 x 10 8 m/dt. Dua peralatan elektronik yang masing-masing mampu melakukan 10 12 operasi/detik dan terpisah dengan jarak 0.5 mm. Dalam hal ini akan lebih lama waktu yang diperlukan bagi sinyal melakukan perjalanan antar dua peralatan tersebut daripada waktu yang diperlukan untuk melakukan eksekusi operasi (10-12 detik) oleh salah satu peralatan elektronik tersebut. Jadi faktor pembatasnya adalah kecepatan cahaya. SOLUSI: mendayagunakan paralelisme Motivation of Parallel Computing Parallel Computing is cost effective Off the shelf, commodity processors are very fast Memory is very cheap Building a processor that is a small factor faster costs an order of magnitude more NoW is the time! Cheapest way to get more performance: multiprocessor NoW: Networks of workstations Workstation can be an SMP SMP: Symmetric Multi Processor Shared memory Bus Wile E. Coyote s Parallel Computer Get a lot of the fastest processors Get a lot of memory per processor Get the fastest network Hook it all together And then what??? 3
Now you need to program it! Problem with Wile E. Coyote Architecture Parallel programming introduces: Task partitioning, task scheduling Data partitioning Synchronization Load balancing Latency issues hiding tolerance Von Neumann Machines not built for //ism To get high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower in terms of number of cycles it takes to access Memory hierarchy gets more and more complex Memory accesses block No split phase memory access Sequential vs Parallel Algorithms Efficient Parallel Algorithms Maximize parallelism Minimize synchronization, remote accesses Efficiency is Architecture Dependent Efficient Sequential Algorithms Minimize time, space Efficiency is portable Efficient C program on Pentium ~ Efficient C program on Alpha Speedup Ideal: n processors n fold speed up Ideal not always possible. WHY? Tasks are data dependent Not all processors are always busy Remote data Super linear speedup: >n speedup Nonsense! Because we can execute the faster parallel program sequentially No nonsense!! Because parallel computers do not just have more processors, they have more caches 4
Parallel Programming Parallel Programming Paradigms Super compilers 20 years of parallelizing compilers and what do we get?..not much: we understand loops (a bit) Multithreading Pthreads, Solaris threads, not much difference Message Passing MPI rules,..well, there is PVM (parallel virtual machine) Data parallel programming Niche work, but important Implicit vs Explicit //ism Implicit: super compilers Extract parallelism from sequential program The general case is too hard pointers, aliases, recursion, separate compilation dynamic dependence distances in array references Explicit Parallelism: threads or messages Complicates programming creation, allocation, scheduling of processes data partitioning Synchronization ( locks, messages ) Pemrosesan Sekuensial & Paralel Klasifikasi Mesin Paralel Models of Computation ( Flynn 1966 ) 3 x lebih cepat dari 1. Single Instruction Stream, Single Data Stream : SISD. 2. Multiple Instruction Stream, Single Data Stream : MISD. 3. Single Instruction Stream, Multiple Data Stream : SIMD. 4. Multiple Instruction Stream, Multiple Data Stream : MIMD. 5. Single Program Multiple Data: SPMD. 5
SISD Computers von Neumann Architecture Computer Untuk operasi a 1 + a 2 + a 3 + + a n memerlukan sebanyak n akses ke memori oleh prosesor dan sebanyak n-1 operasi penjumlahan. Jadi kompleksitas waktu operasi adalah O(n). MISD Computers N prosesor yang memiliki unit kontrol pribadi, berbagi guna memori bersama (shared memori). SIMD Computers N prosesor beroperasi dibawah kendali aliran instruksi tunggal yang dikeluarkan oleh unit kontrol pusat. Parallelisme diperoleh dengan menugaskan semua prosesor mengerjakan operasi/tugas yang berbeda secara simultan pada data yang sama. The processors operate synchronously and a global clock is used to ensure lockstep operation. 6
MIMD Computers Potensi dari 4 kelas komputer SPMD Computers Program yang sama dieksekusi pada prosesor komputer MIMD. SPMD bukan merupakan paradigma hardware, ini adalah software ekuivalen dari SIMD, namun bersifat asynchronous. Perhatikan instruksi IF X = 0 THEN S1 ELSE S2 Asumsikan X = 0 pada prosesor P1, dan untuk X!= 0 pada prosesor P2 Proses P1 mengeksekusi S1 paralel dengan prosesor P2 mengeksekusi S2 ( ini tidak dapat terjadi pada SIMD ) 7