C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

CUDA assumes that the CUDA threads may execute on a physically separate device that operates as a co-processor to the host running the C program. This is the case, for example, when the kernels execute on a GPU and the rest of the C program executes on a CPU. CUDA also assumes that both the host and the device maintain their own DRAM, referred to as host memory and device memory, respectively. Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime. This includes device memory allocation and deallocation, as well as data transfer between host and device memory.

Function which gets executed on grid is called as kernel function. A kernel is executed by a grid which contains blocks. These blocks contain threads. A thread block is a batch of threads that can co-operate Sharing data through shared memory, and Synchronizing their execution. Threads from different blocks operate independently. Because all threads in a grid execute the same kernel function, they rely on unique coordinates to distinguish themselves from each other and to identify the appropriate portion of the data to process.

These threads are organized into a two-level hierarchy using unique coordinates \96 blockIdx (for block index) and threadIdx (for thread index)- assigned to them by the CUDA runtime system. The blockIdx and threadIdx appear as builtin, pre-initialized variables that can be accessed within kernel functions. When a thread executes the kernel function, references to the blockIdx and threadIdx variable return the coordinates of the thread. Additional built-in variables, gridDim and BlockDim, provide the dimension of the grid and the dimension of each block respectively.

The CUDA kernel execution configuration defines the dimensions of a grid and its blocks. Unique coordinates in threadIdx and threadIdx variables allow threads of a grid to identify themselves and their domains. The threads of a grid can identify themselves and their domains based on variables blockIdx and threadIdx and these variables have unique coordinates. These variables are used in CUDA kernel functions so the threads can properly identify the portion of the data to process based on different levels of memory that is available in CUDA. Once a grid is launched, its blocks are assigned to streaming multiprocessors in arbitrary order, resulting in scalability of CUDA applications. Importantly, the threads in different blocks to synchronize with each other are to terminate the kernel and start a new kernel for the activities after the synchronization point.