Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home




hyPACK-2013 Mode-2 : GPU Comp. CUDA enabled NVIDIA GPU Prog.

NVIDIA\92s Compute Unified Device Architecture (CUDA) is a soft- ware platform for massively parallel high-performance computing on the company's powerful GPUs. NVIDIA\92s software CUDA Programming model automatically manages the threads and it is significantly differs from single threaded CPU code and to some extent even the parallel code. Efficient CUDA programs exploit both thread parallelism within a thread block and coarser block parallelism across thread blocks. Because only threads within the same block can cooperate via shared memory and thread synchronization, programmers must partition computation into multiple blocks.

In all the programs, CUDA_SAFE_CALL() that surrounds CUDA API calls is a utility macro that we have provided as part of Hands-on codes. It simply detects that the call has retuned an error, prints the associated error message, and exists the application with ERROR FAILURE code.



CUDA enabled NVIDIA GPU: A Scalable Parallel Programming Model

CUDA is aimed to provide solution for many applications and NVIDIA\92s new GPU which supports double precision floating point mathematical operations can address broader class of applications. CUDA is a parallel programming model and software environment designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C. CUDA requires programmers to write special code for parallel processing but it doesn't require them to explicitly manage threads, which simplifies the programming model. CUDA includes C/C++ Software development tools, functions libraries and a hardware abstraction mechanism that hides the GPU hardware from developers.

A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count. New CUDA compatible GPUs are implemented as a set of multiprocessors. Each multiprocessor has several ALUs (Arithmetic Logic Unit) that, at any given clock cycle, execute the same instructions but on different data. Each ALU can access (read and write) the multiprocessor shared memory and the device RAM.

At its core are three key abstractions \96 a hierarchy of thread groups, shared memories, and barrier synchronization \96 that are simply exposed to the programmer as a minimal set of extensions to C. CUDA Programming model automatically manages the threads and it is significantly differs from single threaded CPU code and to some extent even the parallel code.

The goal of the CUDA programming interface is to provide a relatively simple path for users familiar with the C programming language to easily write programs for execution by the device. It consists of:


A runtime library split into:
  • A host component that runs on the host and provides functions to control and access one or more compute devices from host;

  • A device component, that runs on the device and provides device-specific functions;

  • A common component, that provides built-in vector typed and a subset of the C standard library that are supported in both host and device code

  • A host component that runs on the host and provides functions to control and access one or more compute devices from host;

  • A device component, that runs on the device and provides device-specific functions;

  • A common component, that provides built-in vector typed and a subset of the C standard library that are supported in both host and device code

CUDA assumes that the CUDA threads may execute on a physically separate device that operates as a co-processor to the host running the C program. This is the case, for example, when the kernels execute on a GPU and the rest of the C program executes on a CPU. CUDA also assumes that both the host and the device maintain their own DRAM, referred to as host memory and device memory, respectively. Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime. This includes device memory allocation and deallocation, as well as data transfer between host and device memory.

Function which gets executed on grid is called as kernel function. A kernel is executed by a grid which contains blocks. These blocks contain threads. A thread block is a batch of threads that can co-operate Sharing data through shared memory, and Synchronizing their execution. Threads from different blocks operate independently. Because all threads in a grid execute the same kernel function, they rely on unique coordinates to distinguish themselves from each other and to identify the appropriate portion of the data to process.

These threads are organized into a two-level hierarchy using unique coordinates \96 blockIdx (for block index) and threadIdx (for thread index)- assigned to them by the CUDA runtime system. The blockIdx and threadIdx appear as builtin, pre-initialized variables that can be accessed within kernel functions. When a thread executes the kernel function, references to the blockIdx and threadIdx variable return the coordinates of the thread. Additional built-in variables, gridDim and BlockDim, provide the dimension of the grid and the dimension of each block respectively.

The CUDA kernel execution configuration defines the dimensions of a grid and its blocks. Unique coordinates in threadIdx and threadIdx variables allow threads of a grid to identify themselves and their domains. The threads of a grid can identify themselves and their domains based on variables blockIdx and threadIdx and these variables have unique coordinates. These variables are used in CUDA kernel functions so the threads can properly identify the portion of the data to process based on different levels of memory that is available in CUDA. Once a grid is launched, its blocks are assigned to streaming multiprocessors in arbitrary order, resulting in scalability of CUDA applications. Importantly, the threads in different blocks to synchronize with each other are to terminate the kernel and start a new kernel for the activities after the synchronization point.

Basic CUDA Definitions :

Host : : Refer to the CPU and the system\92s memory as the host
Device : : Refer to the GPU and its memory
Kernel : : A function that executes on the device is typically called a kernel.


Simple Program Kernel Call :

The program given below, includes important additions to the simple sequential code to make “CUDA enabled Program“.

  • A empty function named kernel() qualified with _global_
  • A call to empty function, written in the form as <<<1,1>>>

In Linux system, GNU gcc compiles the code on host and nvcc gives the function a kernel() to the complier that handles device code and main() to the host compiler. instead of the host. Here, the calling the device code from the host code is important and it is similar to host-function calls. The angle brackets denote arguments that are passed to the runtime system. These parameters are not arguments to the device code but are parameters that will influence how the runtime will launch the device code. Arguments to the device code itself passed within the parentheses.


(Download source code : cuda-hello-world.cu )

The first CUDA parallel program is “Hello World” program, which simply prints the message “Hello World ”". The program calls function kernel() as explained above.

Example Program : A Kernel Call Program

  #include <stdio.h> 
  #include <cuda.h> 


    _global_   void kernel (void) {
 }

  int main ( void ) {

  kernel <<< 1,2 >>>();
  Printf ( " Hello World \n " );
  return 0;
  }

Passing Parameters - Allocation of Memory & Pointers

  • Parameters : Pass parameters to a kernel similar to any C function in standard C. These parameters need to get from the host to device at run-time and the runtime system takes care of these parameter definitions.

  • Allocate Memory: Allocate memory to do useful work on a device, such as return value to the host. Allocation of memory using cudaMalloc() is similar to standard C call malloc(), but it tells the CUDA runtime to allocate the memory on the device.

  • First argument is a pointer to hold the address of the newly allocated memory. Second parameter is the size of the allocation.

  • CUDA_SAFE_CALL() that surrounds these calls is a utility macro which detects that the call has returned an error, prints the associated error message and exists the application in “clean” fashion with an EXIT_FAILURE code.

Many other error-handling checks are required in production code.

The following program performs addition of two values in which data transfer from host to device and device to host are performed. Here, the parameters to a kernel are passed and memory is allocated on host and device.

The second CUDA parallel program is focused on"passing parameters to a kernel and allocate memory on a device". It performs addition of two values and the program calls function kernel() as given in the function _global_ void add. The description of program is as follows:




(Download source code : cuda-simple-kernel-parameter-program )

Example Program : A Kernel - Passing Parameters & Memory Allocation

#include <stdio.h> 
#include <cuda.h> 

    _global_   void add ( int a, int b, int *c ) {
      *c = a + b;
}

/* Utility Macro : CUDA SAFE CALL */
void CUDA_SAFE_CALL( cudaError_t call)
{

cudaError_t ret = call;
switch(ret)
{
case cudaSuccess:
     
break;
default :
      {
printf(" ERROR at line :%i.%d' ' %s\n",
__LINE__,ret,cudaGetErrorString(ret));
exit(-1);
break;
     }
}
}

int main ( void ) {
int c;
int *dev_c;

CUDA_SAFE_CALL( cudaMalloc( (void**)&dev_c, sizeof (int) ) );

add <<< 1,1 >>>(2,3,dev_c);

CUDA_SAFE_CALL(cudaMemcpy(
                      &c,
                      dev_c,

                      sizeof (int),
                      cudaMemcpyDeviceToHost) );

Printf("2 + 3 %d \n ", c);
cudaFree( dev_c);
return 0;
}



The second CUDA parallel program is focussed on “passing parameters to a kernel and allocate memory on a device”". It performs addition of two vlaues and the program calls function kernel() as given in the function _global_ void add.

Responsibility of programmer : The programmer should aware of restrictions on the usage of device pointers which are summarized as follows.

  • Pass pointers allocated with cudaMalloc() to functions that execute on the device

  • Use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device.

  • Pass pointers allocated with cudaMalloc() to read or write memory from code that executes on the device.

  • Cannot use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host.

To free memory for allocated with cudaMalloc(), use a call cudaFree().

To access device memory \96 by using device pointers from within device code and by using calls to cudaMemcpy() from host code.

The last parameter to cudaMemcpy() is cudaMemcpyDeviceToHost instructing the runtime that the source pointer is a device pointer and the destination pointer is a host pointer.

cudaMemcpyHostToDevice instructing the source data is on the host and the destination is an address on the device.

Also, one can specify cudaMemcpyDeviceToDevice which indicates both pointers are on the device.


Querying Devices

An easy interface to determine the information such as to find mechanism for determining which devices (if any) are present and what capabilities each device supports is provided. First, to get count of how many CUDA devices in the system are built on CUDA Architecture call the API cudaGetDeviceCount(). After calling cudaGetDeviceCount(), then iterate through the devices and query relevant information about each device. The CUDA runtime returns device properties in a structure of type cudaDeviceProp. As of CUDA 4.1 & CUDA 5.0, the cudaDeviceProp structure contains the necessary information and most of the information in cudaDeviceProp is self explanatory and commonly used CUDA device properties. The description of third example program is focused on device properties as given below. :


#include <stdio.h> 
#include <time.h> 
#include <cuda.h> 

#define KB 1024       /* To indicate results in KiloBytes */

/* Utility Macro : CUDA SAFE CALL */
void CUDA_SAFE_CALL( cudaError_t call)
{

cudaError_t ret = call;
switch(ret)
{
case cudaSuccess:
     
break;
default :
      {
printf(" ERROR at line :%i.%d' ' %s\n",
__LINE__,ret,cudaGetErrorString(ret));
exit(-1);
break;
     }
}
}

int main ( void ) {
int count;
cudaDeviceProp prop;

CUDA_SAFE_CALL(cudaGetDeviceCount( &count) );

for(int i = 0; i < count; index++) {
CUDA_SAFE_CALL( cudaGetDeviceProperties( &prop, i) );

printf("Information about the device \t: %d\n", count);

printf("Name \t\t\t\t: %s\n",prop.name);

printf("Compute capability \t\t: %d.%d\n",
        prop.major, prop.minor);

printf("Clock rate \t\t\t: %d\n", prop.clockRate);

printf("Device overlap \t\t\t: ");


if (prop.deviceOverlap)
        printf("ENABLED \n");

else
        printf("DISABLED\n");

printf("Kernel execution timeout \t: ");


if (prop.kernelExecTimeoutEnabled)
        printf("ENABLED\n");

else
        printf("DISABLED\n");

printf("Total global memory \t\t: %ld MB\n",
       (prop.totalGlobalMem/KB)/KB);

printf("Total constant memory \t\t: %ld\n",
        prop.totalConstMem);

printf("Maximum memory pitch \t\t: %ld\n", prop.memPitch);

printf("Texture alignment \t\t: %ld\n",
        prop.textureAlignment);

printf("Multiprocessor count \t\t: %d\n",
        prop.multiProcessorCount);

printf("Shared memory per MP \t\t: %ld KB\n",
        prop.sharedMemPerBlock/KB);

printf("Registers per MP \t\t: %ld\n",
        prop.regsPerBlock);

printf("Threads in warp \t\t: %d\n", prop.warpSize);

printf("Maximum threads per dimension \t: %d\n",
        prop.maxThreadsPerBlock);

printf("Maximum thread dimension \t: (%d, %d, %d)\n",
        prop.maxThreadsDim[0], prop.maxThreadsDim[1],
        prop.maxThreadsDim[2]);

printf("\n\n\n");

}
return 0;
}





CUDA Device Structure

  struct cudaDevice Prop {
char name[256];
size_t totalGlobalMem;
size_t sharedMemBlock;
int regsPerBlock;
int warpSize;
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[1];
int maxGridSize[3];
size_t totalConstMem;
int major;
int minor;
int clockRate;
size_t texturealignment;
int deviceOverlap;
int multiProcessorcount;
int KernelExecutionTimeoutEnabled;
int integrated;
int canMapHostMemory;
int computeMode;
int maxTexture1D;
int maxTexture2d[2];
int maxTexture3d[3];
int maxTexture2dArray[3];
int concurrentKernels;
  }



CUDA Device Properties (Refer NVIDIA CUDA Programming Guide)

Device Property Description
char name [256]; An ASCII string indentifying the device [e.g., GeForce GTX 280"]
size_t totalGlobalMem The amount of global memory on the devices in bytes
size_t shareMemPerBlock The maximum amount of shared memory a single block may use in bytes
int regsPerBlock The number of 32-bit registers available per block
int warpSize The number of threads in a warp
size_t memPitch The maximum pitch allowed for memory copies in bytes.
int maxThreadsPerBlock The maxmum number of threads that a block may contain
int maxThreadsDim[3] The number of blocks allowed along each dimneison of a grid
size_t totalConstMem The amount of avialable constant memory
int major The major revision of the device's compute capability
int minor The minor revision of the device's compute capability
Global Memory size : 1073741824
size_t textureAlignment The device's requirement for texture alignment
int deviceOverlap A bollean value representing whether the device can simultaneously perform a cudaMemcpy() and kernel execution
int multiProcessorCount The number of multiprocessors on the device
int kernelExecTimeoutEnabled A bollean value representing whether there is a runtime limit for kernels executed on this device
int integrated A bollean value representing whether the device is an integrated GPI (i.e., part of the chipset and not a discrete GPU)
int canMapHostMemory A bollean value repesenting whether the device can map host memory into the CUDA device addres space.
int computeMode A vlaue representing the device's computing mode default, exclusive or, prohibited
int maxTexture1D The maximum size supported for 1D textures
int maxTextture2D[2] The maximum dimensions supported for 2D textures
int maxTextture3D[3] The maximum dimensions supported for 3D textures
int maxTextture2DArray[3] The maximum dimensions supported for 2D texture arrays
int concurrentKernels A bollean value repesenting whether the device supports executing multiple kernels within the same context simultaneously.


Using Device Properties : Program

An easy interface to determine the information such as to find mechanism for determining which devices (if any) are present and what capabilities each device supports is provided. First, to get count of how many CUDA devices in the system are built on CUDA Architectture call the API cudaGetDeviceCount(). After calling cudaGetDeviceCount(), then iterate through the devices and query relevant information about each device. The CUDA runtime returns device properties in a structure of type cudaDeviceProp. As of CUDA 4.1 & CUDA 5.0, the cudaDeviceProp structure contains the necessary information and most of the information in cudaDeviceProp is self explanatory and commonly used CUDA device properties.

Query with cudaGetDeviceProperpties() is useful for applications in which kernel needs close interaction with the CPU, and applications that may be executed on the integrated GPU that shares system memory with the CPU.

For example, if application depends upon on having double-precision floatin-point support, then there is a need to check on each card that have compute capability 1.3 or higher support double-precision floating point mathematical calculations. To run the application, we need to find at least one device of compute capability 1.3 or higher.

First, we need to fill a cudaDeviceProp strcuture with the properties related to device and pass it to cudaChooseDevice() to have CUDA runtime find a device that satisfies the given constraint. The call to cudaChooseDevice() returns a device ID that we can then pass to cudaSetDevice() . The description of program is as follows:




Hot to find Device

(Download source code : cuda-find-device )

#include <stdio.h> 
#include <time.h> 
#include <cuda.h> 

/* Utility Macro : CUDA SAFE CALL */
void CUDA_SAFE_CALL( cudaError_t call)
{

cudaError_t ret = call;
switch(ret)
{
case cudaSuccess:
     
break;
default :
      {
printf(" ERROR at line :%i.%d' ' %s\n",
__LINE__,ret,cudaGetErrorString(ret));
exit(-1);
break;
     }
}
}

int main ( void ) {
int count;
int dev;
cudaDeviceProp prop;

CUDA_SAFE_CALL(cudaGetDeviceCount( &count) );

for(int i = 0; i < count; i++) {
CUDA_SAFE_CALL( cudaGetDeviceProperties( &prop, i) );

CUDA_SAFE_CALL( cudaGetDevice(&dev) );

printf("Information about the device \t: %d\n", count);

printf("Name \t\t\t\t: %s\n",prop.name);

printf("printf("ID of the device : %d\n", dev),
memset(&prop, 0, sizeof (cudaDeviceProp));

prop.major = 1;
prop.minor = 3;

CUDA_SAFE_CALL( cudaChooseDevice(&dev, &prop ) );

printf("ID of CUDA device closest to revision 1.3 :
          %d \n", dev);

CUDA_SAFE_CALL( cudaSetDevice(dev) );

}
return 0;
}







CUDA Program for Vector Vector Addition

The input vectors are generated on host-CPU and transfer the vectors to device-GPU for vector vector vector addition. A simple kernel based on the grid of thread blocks is generated in which thread is given a unique thread ID within its block. Each thread performs partial addition of two vectors and the final resultant value is generated on device-GPU and transferred to host-CPU. Important steps are given below.

Steps Description
1. Memory allocation on host-CPU and device-GPU :
Allocate memory for two input vectors and resultant vector on host-CPU & device-GPU
Use cudaMalloc(void** array, int size)

2. Input data Generation :
Fill the input vector with single/double precision real values using randomized data as per input specification
3. Transfer data from host-CPU to device-GPU:
Transfer the host-CPU vector to device-GPU to perform computation
Use cudaMemcpy((void*)device_array, (void*)host_array, size , cudaMemcpyHostToDevice )
4. Launch Kernel :
Define the dimensions for Grid and Block on host-CPU and launch the kernel for execution on device-GPU.
Computation on device is performed for vector vector addition
5. Transfer the result from device-GPU to host-CPU :
Copy resultant vector to host-CPU from device-GPU.
Use cudaMemcpy((void*)host_array, (void*)device_array, size , cudaMemcpyDeviceToHost)
6. Check correctness of the result on host-CPU
Compute vector-vector addition on host-CPU and Compare CPU & GPU results.
7. Free the memory
Free the memory of arrays allocated on host-CPU & device-GPU
Use cudaFree(void* array)



(Download source code : cuda-vector-vector-addition-blocks )



Example Program : Vector Vector Addition

#include <stdio.h> 
#include <cuda.h> 

#define N = 100


  _global_void add ( int *a, int *b, int *c ) {

int tid = blockIdx.x; // Start the data at this index

// CUDA C allows you to define a group of blocks in two dimensions


If ( tid < N)
        c[tid] = a[tid] + b[tid];
  }

  /* Utility Macro : CUDA SAFE CALL */
  void  CUDA_SAFE_CALL( cudaError_t call)
  {
cudaError_t ret = call;
switch(ret)
{
case cudaSuccess:
     
break;
default :
      {
printf(" ERROR at line :%i.%d' ' %s\n",
__LINE__,ret,cudaGetErrorString(ret));
exit(-1);
break;
     }
}
  }

  int main ( void ) {
int a[N],b[N], c[N];
int *dev_a, *dev_b, *dev_c;

// Allocate memory on the Device GPU
CUDA_SAFE_CALL( cudaMalloc( (void**)&dev_a, N * sizeof (int) ) );

CUDA_SAFE_CALL( cudaMalloc( (void**)&dev_b, N * sizeof (int) ) );

CUDA_SAFE_CALL( cudaMalloc( (void**)&dev_c, N * sizeof (int) ) );

//fill Arrays 'a' and 'b' on the Device GPU
for(int i = 0; i < N; i++) {
        a[i] = -i;
        b[i] = i+1;
}

// Copy the Arrays 'a' and 'b' to Device GPU
CUDA_SAFE_CALL(cudaMemcpy(
                      &dev_a,
                      a,

                      N * sizeof (int),
                      cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL(cudaMemcpy(
                      &dev_b,
                      b,

                      N * sizeof (int),
                      cudaMemcpyHostToDevice) );

add <<< N,1 >>>(dev_a, dev_b, dev_c);

CUDA_SAFE_CALL(cudaMemcpy(
                      &dev_c,
                      c,

                      N * sizeof (int),
                      cudaMemcpyDeviceToHost) );

// Display the results on Host CPU
for(int i = 0; i < N; i++) {
          printf("%d + %d = %d \n", a[i], b[i], c[i] );
}

// Free the memory allocated on the Device GPU
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);

return 0;
  }





Vector Vector Addition (Thread Cooperation-Splitting Blocks)

In the earlier examples, a simple kernel add is based on the grid of thread blocks is generated in which thread is given a unique thread ID within its block. Each thread performs partial addition of two vectors and the final resultant value is generated on device-GPU and transferred to host-CPU.


// N stands for number of Blocks
// 1 stands for number of threads per block


kernel <<< N,1 >>>();





The CUDA runtime allows these blocks to the split into threads. In earlier example, N blocks are running on the GPU. To identify, which block is running, the variable blockIdx.x is used in the kernel code.

  _global_void add ( int *a, int *b, int *c ) {

int tid = blockIdx.x; // Start the data at this index

// UDA C : Allows to define a group of blocks in two dimensions


If ( tid < N)
        c[tid] = a[tid] + b[tid];
  }





Here, the variable blockIdx.x is built in variable that the CUDA runtime defines it. CUDA C defines a group of blocks in two dimensions. CUDA runtime allows these blocks to be split into threads.

In above, the first argument in the angle brackets i.e., N represents the number of blocks to be launched and the second parameter represents the number of threads per block. CUDA runtime creates “N Parallel threads” in the above example as given below.

  N blocks X 1 thread/block = N Parallel threads
In above example, a launch of N blocks of one thread is done at CUDA runtime.


kernel <<< 1,N >>>();

In above, a launch of N threads , all within one block is performed.

In earlier example program, the input and the output data is indexed by block Index, i.e.,

int tid = blockIdx.x;


We have a single block with many threads, to index the data by thread index, we have

int tid = threadIdx.x;

With above, we can re-write the code in order to move from a parallel block implementation to a parallel thread implementation. The source code listing is given in the following few lines.

  _global_void add ( int *a, int *b, int *c ) {

int tid = threadIdx.x; // Start the data at this thread index

// CUDA C : Allows to define a group of blocks in two dimensions


If ( tid < N)
        c[tid] = a[tid] + b[tid];
  }

The number of blocks in a single launch is 65,535 and the hardware limits the number of threads per block with which we can launch a kernel. maxThreadsPerBlock specifies the maximum and the number can not exceed this as given in device properties structure. For many GPUs, this limit is 512 threads per block.

To incorporate multiple blocks and threads, the indexing will start to look similar to the standard method for converting from a two-dimensional index space to a linear space.

We use new built variable blockDim. This variable is a constant for all blocks and stores the number of threads along each dimension of the block. In the present example, we are using one-dimensional block, we refer only to blockDim . The number of blocks along each dimension of the entire grid are stored in gridDim. It is important to note that gridDim, whereas a blockDim is actually three-dimensional.

CUDA runtime allows you to launch a two-dimensional grid of blocks where each block is a three dimensional array of threads.



  _global_void add ( int *a, int *b, int *c ) {
int tid = threadIdx.x + bllockIdx.x + BlockDim.x;
  }





Vector Vector Addition (Dimension of Grid & Each Block)

In CUDA, all the threads in a grid execute the same kernel function. Also, each thread has unique coordinate to distinguish themselves from other and to identify the approricate portion of the data to access. These threads are organized into a two-level hierarchy using unique co-ordinates. These are blockIdx (for block index) and threadIdx (for thread index)- assigned to them by the CUDA runtime system and these variables can be accessed within kernel functions. When a thread executes the kernel function, references to the blockIdx and threadIdx variable return the coordinates of the thread. Additional built-in variables, gridDim and BlockDim , provide the dimension of the grid and the dimension of each block respectively.

In CUDA thread organization, the grid consists of N thread blocks, and each block, in turn, consists of M threads. Each grid has a total of N*M threads. All blocks at the grid level are organized as a one- or two- dimensional (1D or 2D) arrays; all threads within each block are also organized as a one- or two- or three- dimensional (1D or 2D or 3D ) arrays.

In general, a grid is organized as a 2D array of blocks. Each block is organized into a 3D array of threads. The exact organization of a grid is determined by the execution configuration provided at kernel launch. The first parameter of the execution configuration specifies the dimensions of the grid in terms of number of blocks. The second specifies the dimensions of each block in terms of number of threads. Each such parameter is a dim3 type, which is essentially a C struct with three unsigned integer fields: x, y, and z. Because grids are 2D arrays of block dimensions, the third field of the grid dimension parameter is ignored; it should be set to 1 for clarity. The following host code can be used to launch the kernel and details are explained below.


dim3 dimGrid(128, 1,1);
dim3 dimBlock(32, 1,1);
Kernel Functi on<<< dimGrid, dimBlock>>> (...);
// N stands for number of Blocks
// 1 stands for number of threads per block


kernel <<< N,1 >>>();

The first two statements initialize the execution configuration parameters. Because the grid and the blocks are ID arrays, only the first dimension of dimBlock and dimGrid are used. The other dimensions are set to 1. The third statement is the actual kernel launch. The execution configuration parameters are between <<< and >>>.

The values of grid Dim.x and grid Dim.y can be calculated based on other variables at kernel launch time. Once a kernel is launched, its dimensions cannot change. All threads in a block share the same blockIdx value. The b1ockIdx.x value ranges between 0 and gridDim.x-1 , and the blockIdx.y value between 0 and gridDim.y-1 .

In general, blocks are organized into 3D arrays of threads. All blocks in a grid have the same dimensions. Each threadldx consists of three components: the x coordinate threadldx.x, the y coordinate threadldx.y, threadldx.y, and the z coordinate threadldx.z The number of threads in each dimension of a block is specified by the second execution configuration parameter given at the kernel launch. With the kernel, this configuration parameter can be accessed as a predefined struct variable, blockDim. The total size of a block is limited to 512 threads, with flexibility in distributing these elements into the three dimensions as long as the total number of threads does not exceed 512. For example, (512, 1, 1), (8, 16, 2), and (16, 16, 2) are all allowable blockDim values,


#include <stdio.h> 
#include <cuda.h> 

#define EPS 1.0e-12
#define GRIDSIZE 10
#define BLOCKSIZE 16

#define SIZE 128



double *dMatA, *dMatB, *dresult;

double *hMatA, *hMatB, *hMatC, *hresult;

double *CPU_Result;

int velenth, count = 0;

int blockWidth; *hresult;

cudaEvent_t start, stop;
cudaDevelopProp start, stop;
int device_Count, size = SIZE;

  _global_void vectvectadd(double dm1, double *dm2, double *dres, int num) {

int int tx = blockIdx.x*blockDim.x + threadIdx.x;
int int ty = blockIdx.y*blockDim.y + threadIdx.y;
int int tindex = tx + (gridDim.x)*(blockDim.x)*ty;

     if(tindex < num)
     dres[tindex]= dm1[tindex]+dm2[tindex];

  }

  /* Check for safe return of all calls to the device */
  /* Utility Macro : CUDA SAFE CALL */
  void  CUDA_SAFE_CALL( cudaError_t call)
  {
cudaError_t ret = call;
switch(ret)
{
case cudaSuccess:
     
break;
default :
      {
printf(" ERROR at line :%i.%d' ' %s\n",
__LINE__,ret,cudaGetErrorString(ret));
exit(-1);
break;
     }
}
  }

  /* Fill in the vector with double precision Values */
  void  fill_dp_vector( double* vec, int size)
  {
//fill Arrays 'vector' on the Device GPU
for(int i = 0; i < size; ind++)
        vec[i] = drand48();
  }

  /* Terminate and exit on errors on host-memory allocation.
      This is called from the functions whch actually execute the benchmark */

  void check_block_grid_dim(
                                    cudaDeviceProp devProp,
                                    dim3 blockDim,
                                    dim3 gridDim)
  {
If (blockDim.x >= devProp.maxThreadsDim[0] ||
    blockDim.y >= devProp.maxThreadsDim[1] ||
    blockDim.z >= devProp.maxThreadsDim[2] )

{
exit(-1);
}
If (gridDim.x >= devProp.maxGridSize[0] ||
    gridDim.y >= devProp.maxGridSize[1] ||
    gridDim.z >= devProp.maxThreadsDim[2] )

{
exit(-1);
}
  }


  /* Memory Allocation Errors */
  void  mem_error( char *arrayname, char*benchmark, intlen, char*type )
  {
printf("\n Memory not sufficient to allocate for array
        %s\n\t Benchmark : %s \n \t
        Memory requested = %d number of %s elements\n",
        arrayname, benchmark, len, type);

printf("\n\t Aborting \n\n");
exit(-);
  }

  /* Memory Allocation Errors */
  int  get_DeviceCount()
  {
  int count;
  cudaGetDeviceCount(&count);
  retun count;
  }

  /* Device Query Information */
  void  deviceQuery()
  {
  int device_Count, device;
  device_Count = get_DeviceCount();

  cudaSetDevice(0);
  cudaGetDevice(&device);
  cudaGetDeviceProerties(&deviceProp, device);
  }

 /* Launch Kernel */
  void  launch_kernel()
  {
  dim3 dimBlock(BLOCKSIZE, BLOCKSIZE);

  dim3 dimGrid((vlength/BLOCKSIZE*BLOCKSIZE)+1,1);

  check_block_grid_dim (deviceProp,dimBlock,dimGrid);

  vectvectadd <<< dimGrid, dimBlock >>>
  (dMatA, dMatB, dresult, vlength);
  }

 /* Device Memory free */
  void  dfree (double* arr[ ], int len)
  {
//Memory Free on Device GPU
for(int i = 0; i < len; i++)
CUDA_SAFE_CALL( cudaFree(arr[i]));
printf("memory freed \n ");

  /* Vector Vector Addition */
  void  vectvect_add_in_cpu(double *A, double *B, docuble *C, int size)
  {
//Memory Free on Device GPU
for(int i = 0; i < size; i++)
c[i] = a[i] + b[i];
  }

  /* print_Gflops_rating */
  void  print_Gflops_rating(float Tsec, int size)
  {
//Measuring Gflop Rating
double gflops;
gflops = (1.0e-9 * ( (1.0 * size ) /Tsec) );
// printf("Gflops is \t%f\n",gflops);
  }

  /* print_on_screen */
  void  print_on_screen(char *program_name, float tsec, double gflops,
          int size, int flag)    // flag = 1 if Glfops calculation else flag = 0
  {
  printf("\n ........................ \n" program_name);
  printf("\t SIZE \t TIME_SEC \t Gflops\n");
  if(flag==1)
      printf("\t%d \t%f \t%lf \t",size,tsec,gflops);
  else
      printf("\t%d \t%lf \t%lf \t",size,"---","---");
  }

  int main ( void ) {
double *array[3];
array[0] = dMatA;
array[1] = dMatB;
array[2] = dresult;
deviceQuery();

// CUDA Event : Time Calculation
CUDA_SAFE_CALL( cudaEventCreate(&start));
CUDA_SAFE_CALL( cudaEventCreate(&stop));


// Allocation host memory
hMatA = (double*) malloc( vlength * sizeof(double));
if(hMatA == NULL)
mem_error("hMatA","vectvectadd",vlength,"double");

hMatB = (double*) malloc( vlength * sizeof(double));
if(hMatB == NULL)
mem_error("hMatB","vectvectadd",vlength,"double");

hMatC = (double*) malloc( vlength * sizeof(double));
if(hMatC == NULL)
mem_error("hMatC","vectvectadd",vlength,"double");


// Allocation Device memory
CUDA_SAFE_CALL( cudaMalloc( (void**)&dMatA, vlenght * sizeof (double) ) );

CUDA_SAFE_CALL( cudaMalloc( (void**)&dMatB, vlenght * sizeof (double) ) );

CUDA_SAFE_CALL( cudaMalloc( (void**)&dresult, vlenght * sizeof (double) ) );

// Fill the data in Host Vectors
fill_dp_vector(hMatA,vlength);
fill_dp_vector(hMatB,vlength);

CUDA_SAFE_CALL(cudaMemcpy(
                      (void*)&dMatA,
                      (void*)&hMatA,
                      vlength * sizeof (double),
                      cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL(cudaMemcpy(
                      (void*)&dMatB,
                      (void*)&hMatB,
                      vlength * sizeof (double),
                      cudaMemcpyHostToDevice) );

// CUDA Event : Time Calculation
CUDA_SAFE_CALL( cudaEventRecord (start, 0));

launch_kernel();

CUDA_SAFE_CALL( cudaEventCreate(&stop));

/* calling device kernel */
CUDA_SAFE_CALL(cudaMemcpy(
                      (void*)hresult,
                      (void*)dresult,
                      vlength * sizeof (double),
                      cudaMemcpyDevicetoHost) );

printf("\n --------------------------------------------------");

// Calcultation of Gflops ↦ Printing
print_Gflops_rating(Tsec, vlenght);

print_on_scerren("vect vect Addition", Tsec,print_Gflops_rating(Tsec,vlength),size,1);

// Free the memory allocated on the Device GPU
dfree(array,3);

// Free the memory allocated on the CPU
free(hMatA);
free(hMatB);
free(hresult);

return 0;
  }




(Download source code : cuda_memcheck_nvml.c )

CUDA Compilation, Linking and Execution of Program

For Compilation of CUDA program, additional steps are involved, partly because the program targets two different processor architectures (the GPU and a host CPU), and partly because of CUDA\92s hardware abstraction. Compiling a CUDA program is not as straightforward as running a C compiler to convert source code into executable object code. The same source file mixes C/C++ code written for both the GPU and the CPU, and special extensions and declarations identify the GPU code. The first step is to separate the source code for each target architecture.

nvcc is a compiler driver that simplifies the process of compiling CUDA code: It provides simple and familiar command line options and executes them by invoking the collection of tools that implement the different compilation stages. nvcc\92s basic work flow consists in separating device code from host code and compiling the device code into a binary form or cubin object. The generated host code is output either as C code that is left to be compiled using another tool or as object code directly by invoking the host compiler during the last compilation stage.


software stack

Figure 1. CUDA : Source Code Compilation Stages.



CUDA code should include the cuda.h header file. On the compilation command line, the cuda library should be specified to the linker on UNIX and Linux environments. Two steps are explained below.

Using command line arguments to compile CUDA source code:

The compilation and execution details of a CUDA program is simple as like compilation of C language source code.

$ nvcc -o < executable name > < name of source file >

For example to compile a simple Hello World program user can give :

$ nvcc -o helloworld cuda-helloworld.cu

Executing a Program:

To execute a CUDA Program, give the name of the executable at command prompt.

$ . / < Name of the Executable >

For example, to execute a simple HelloWorld Program, user must type:

$ ./helloworld

The output must look similar to the following:

Hello World!


Centre for Development of Advanced Computing