C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

hyPACK-2013 Mode-4: GPU Computing with CUDA enabled NVIDIA GPUs

It is well-known that the computational power of GPUs has widely attracted the scientific community and GPUs provide unprecedented computational power to solve the data intensive applications. The use of the graphical Processing Unit (GPU) to accelerate non-graphics computations has drawn much attention. This is due to the fact that the computational power of GPUs has exceeded that of PC-based CPUs by more than one order of magnitude while being available for a comparable price. CUDA 5.0 is used for development of programs in the lab. sessions and tuning & optimisation techniques are employed to extract the performance of application kernels.

CUDA Programming model automatically manages the threads and it is significantly differs from single threaded CPU code and to some extent even the parallel code. Before availability of NVIDIA's CUDA, some of the users in Parallel Processing Community write codes for GPU. Efficient CUDA programs exploit both thread parallelism within a thread block and coarser block parallelism across thread blocks. Because only threads within the same block can cooperate via shared memory and thread synchronization, programmers must partition computation into multiple blocks. The GPU is viewed as a compute device capable of executing a very high number of threads in parallel. It operates as a coprocessor to the main CPU called host. Data-parallel, compute intensive portions of applications running on the host are transferred to the device by using a function that is executed on the device as many different threads. Both the host and the device maintain their own DRAM, referred to as host memory and device memory, respectively. One can copy data from one DRAM to the other through optimized API calls that utilize the devices high-performance Direct Memory Access (DMA) engines.

The CUDA model is highly parallel as GPGPU model. The approach is to divide the data set into smaller chunks stored in on-chip memory then allows multiple thread processors to share each chunk. Storing the data locally reduces the need to access off-chip memory, thereby improving the performance. Design class of applications that avoid access to off-chip memory in Scientific Computing requires to re-write the application or re-design algorithm. Also, the overheads involved while loading the required off-chip data into local memory, may affect the performance. CUDA handles in an intelligent way in which off-chip memory access usually doesn't stall a thread processor and another thread is ready to execute.

In CUDA, a group of threads work together in round-robin fashion, ensuring that each thread gets execution time without delaying other threads, thereby reducing the thread overheads. The wait for remote access and service strongly factors into a CUDA's efficiency and scaling. A thread block is a batch of threads that can cooperate together by efficiently sharing data through some -fast shared memory and synchronizing their execution to coordinate memory accesses by specifying synchronization points in the kernel. Its thread ID identifies each thread, which is the thread number within the block. An application can also specify a block as a three-dimensional array and identify each thread using a 3-component index.

The CUDA Toolkit is a complete software development solution for programming CUDA enabled GPUs. The Toolkit includes standard FFT and BLAS libraries, a C-compiler for the NVIDIA GPU and a runtime driver. CUDA technology is currently supported on Linux and Microsoft Windows XP operating systems.