C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

hyPACK-2013 Mode-2 : GPU Computing ; CUDA enabled NVIDIA GPU

NVIDIA's Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the company's powerful GPUs. NVIDIA's software CUDA programming model effectively use GPUs which could be harnessed for tasks other than graphics, achieving teraflops of computing power. For high performance computing, the programming model has been designed to improve the shaders, which is commonly used in terminology in Graphics Computing and shaders are called as stream processing or thread processing.

The NVIDIA CUDA technology is a fundamentally new computing architecture that enables the GPU to solve complex computational problems. CUDA technology gives computationally intensive applications access to the processing power of NVIDIA graphics processing units (GPUs) through a new, programming interface. Software development is strongly simplified by using the standard C language. NVIDIA's compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the NVIDIA's powerful GPUs. The game community has been using the NVIDIA's GPUs and graphics cards since long time and at present the graphics market is changing very fast. NVIDIA's GeForce, Quadrobrand and Tesla brand products are steadily winning customers in scientific and engineering fields. Even though GeForce and Quadro brand products has been used for traditional consumer graphics market but the Tesla and Fermi is intended for high-performance computing.

CUDA : NVIDIA's compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the NVIDIA's powerful GPUs. NVIDIA's software CUDA programming model effectively use GPUs which could be harnessed for tasks other than graphics, achieving teraflops of computing power. For high performance computing, the programming model has been designed to improve the shade rs, which is commonly used in terminology in Graphics Computing and shaders are called as "stream processing" or " thread processing". Each thread processor in an NVIDIA GeForce 8-Series GPU can manage 96 concurrent threads, and these processors have their own FPUs, registers and shared local memory. CUDA requires programmers to write special code for parallel processing but it doesn't require them to explicitly manage threads, which simplifies the programming model. CUDA includes C/C++ Software development tools, functions libraries and a hardware abstraction mechanism that hides the GPU hardware from developers. CUDA provides solution for such applications and NVIDIA's new GPU which supports double precision floating point mathematical operations can address broader class of applications.

NVIDIA simplifies the programming model in which the burden of managing the threads is removed. This is an important features of CUDA in which application programmers don't write the explicit threaded code. A hardware thread manager handles the threading automatically. Automatic thread management is vital when multi-threading scales to thousands of threads. NVIDIA's card can manage as many as concurrent threads (more than 10,000 for Ge-Force 8 GPUs) and these are lightweight threads in the sense that each thread can operates on small piece of data, they are fully fledged threads in the conventional sense. That is each thread has its own stack, register file, program counter, and local memory. The GPU handles the state of active and inactive threads and the complete run-time thread management is transparent to the programmer. This also helps to programmer to eliminate the potential bugs in the application.

The G8G-chip on a NVIDIA 8800 Ultra graphics card has 16 multiprocessors with 8 processors each, for a total of 128 processors. These are generalized floating-point processors capable of operating on &-,16- and 32-bit integer types, and 16- and 32-bit floating point types. Each multiprocessor has a memory of 16 KB size that is shared by the processors within the multiprocessor. Access to a location in this shared memory has a latency of only 2 clock cycles allowing fast nonlocal operations. The processors are clocked (Shader Clock) at 1.6GHz, giving the GeForce 8800 Ultra a tremendous amount of floating-point processing power. Each multiprocessor has a Single Instruction, Multiple Data architecture (SIMD).

hyPACK-2013 Mode-2 : GPU Computing ; CUDA enabled NVIDIA GPU

Host and Device

CUDA Software Stack