• Mode-4 GPGPUs • NVIDIA - CUDA/OpenCL • AMD APP OpenCL • GPGPUs - OpenCL • GPGPUs : Power & Perf. • Home




hyPACK-2013 Mode-2 : GPU Comp. CUDA enabled NVIDIA GPU Prog.
CUDA Asynchronous Concurrent Execution & Streams


NVIDIA's software CUDA Programming model automatically manages the threads and it is significantly differs from single threaded CPU code and to some extent even the parallel code. Efficient CUDA programs exploit both thread parallelism within a thread block and coarser block parallelism across thread blocks. On GPUs, massively data-parallel computations can be performed and task parallelism involving multi-threaded CPU applications can also be performed on modern GPUs. In data parallelism, computing the same function on lots of data elements is done and in task parallelism, two or more completely different tasks in parallel.

On present GPUs, the task parallelism based application kernels is growing, and the state-of-art GPUs provide an opportunity for programmers to extract even more speed from GPU-based implementations. CUDA Streams several ways in which execution of certain operations simultaneously on the single and multiple GPUs. Important topics on CUDA C Runtime and “Streams - Asynchronous Concurrent Execution” are discussed in detail with example programs

In all the programs, CUDA_SAFE_CALL() that surrounds CUDA API calls is a utility macro that we have provided as part of Hands-on codes. It simply detects that the call has retuned an error, prints the associated error message, and exists the application with ERROR FAILURE code.



CUDA Runtime functions : Different types of Memory

Initialization : CUDA runtime function is initialized when it is called first time and during initialization, the runtime creates a CUDA context for each device in the system. This context is the primary context for this device and it is shared among all the host threads of the application. The runtime does not expose the primary context to the application. When a host thread calls cudaDeviceReset(), this destroys the primary context of the device the host thread currently operates on. A host thread can set the device it operates on at any time by calling cudaSetDevice().

Device Memory : As discussed earlier, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Kernels can only operate out of device memory, so the runtime provides functions to allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory. CUDA threads may access data from multiple memory spaces during their execution. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. All threads have access to the same global memory. There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages. Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats. The global, constant, and texture memory spaces are persistent across kernel launches by the same application.

Device memory can be allocated either as linear memory or as CUDA arrays. CUDA arrays are opaque memory layouts optimized for texture fetching. Linear memory is typically allocated using cudaMalloc() and freed using cudaFree() and data transfer between host memory and device memory are typically done using cudaMemcpy(). Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). It is possible to get best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions) allocation and this can be achieved by appropriately padded to meet the alignment requirements The following code sample illustrates various ways of accessing global variables via the runtime API: cudaGetSymbolAddress() is used to retrieve the address pointing to the memory allocated for a variable declared in global memory space. The size of the allocated memory is obtained through cudaGetSymbolSize().

Shared memory : Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. For efficient cooperation, the shared memory is expected to be a low-latency memory near each processor core (much like an L1 cache) and __syncthreads() is expected to be lightweight.

shared memory is allocated using the __shared__ qualifier. Shared memory is expected to be much faster than global memory. Any opportunity to replace global memory accesses by shared memory accesses should therefore be exploited as illustrated by the matrix multiplication example in this workshop.

The code samples in Hand-on are an implementation of matrix addition or multiplication that does take advantage of shared memory.

  • In this implementation, each thread block is responsible for computing one square sub-matrix and each thread within the block is responsible for computing one element.

  • In order to fit into the device's resources, these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary and output block matrix is computed as the sum of the products of these square matrices.

  • Each of these products is performed by first loading the two corresponding square matrices from global memory to shared memory with one thread loading one element of each matrix, and then by having each thread compute one element of the product.

  • Each thread accumulates the result of each of these products into a register and once done writes the result to global memory. By blocking the computation this way, we take advantage of fast shared memory and save a lot of global memory bandwidth since block matrices of matrix A is only read times from global memory and block matrices of B is read required number of times. __device__ functions are used to get and set elements and build any sub-matrix from a matrix.



Page-Locked Host Memory :

On NVIDIA GPU, memory is allocated with cudaMalloc(C). CUDA runtime offers its mechanism for allocating host memory cudaHostAlloc(). On host, C library routine malloc() can also be used to allocate memory.

It is important to note that there is a significant difference between the memory that malloc() will allocate and the memory that cudaHostAlloc() allocates. The C library function malloc() allocates standard, pageble host memory, while cudaHostAlloc() allocates a buffer of page-locked host memory. Some times it is called as pinned memory, page-locked buffers have an important property. OS ensures its residency in physical memory and it guarantees that this memory will never page this memory out to disk. This means, the buffer is not evicted or relocated and hence, OS allow an application access to the physical address of this memory.

The CUDA runtime provides functions to allow the use of page-locked (also known as pinned) host memory (as opposed to regular pageble host memory allocated by malloc()): cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory; cudaHostRegister() page-locks a range of memory allocated by malloc().

Knowing the Physical address of a buffer, the GPU can then use direct memory access (DMA) to copy data to or from the host. DMA copies proceed without intervention from the cPU, it also means that the CPU could be simultaneously paging these buffers out to disk or relocating their physical address by updating the OS's page-tables. The CUDA runtime driver still uses DMA to transfer the buffer to the GPU, while performing a memory copy with pageble memory.

It is importatnt to observe that the copy speed of memory copies from pageble memory is bounded by the lower of the PCIe transfer speed and the system front-side bus. The benchmarks are focusing on cudaMemcpy() performance with both pageble and page-locked memory. It should be noted that pageble buffers would still incur the overhead of an additional CPU-managed copy, even the PCI Express and front-side-bus speeds are identical. Free the memory when they no longer needed rather than waiting until application releases the memory while memory is used as a source or destination in calls to cudaMemcpy(). Using page-locked host memory has several benefits:

  • Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices.

  • On some devices, page-locked host memory can be mapped into the address space of the device, eliminating the need to copy it to or from device memory

  • On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining.

  • Page-locked host memory is a scarce resource however, so allocations in page-locked memory will start failing long before allocations in pageble memory. In addition, by reducing the amount of physical memory available to the operating system for paging, consuming too much page-locked memory reduces overall system performance.

Portable Memory :

A block of page-locked memory can be used in conjunction with any device in the system but by default, the benefits of using page-locked memory described above are only available in conjunction with the device that was current when the block was allocated (and with all devices sharing the same unified address space, if any). To make these advantages available to all devices, the block needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister().

CUDA runtime supports Write-Combine Memory which frees up the host's L1 and L2 cache resources, making more cache available to the rest of the application. In addition, write-combining memory is not snooped during transfers across the PCI Express bus, which can improve transfer performance by up to 40%. CUDA runtime supports Mapped Memory which addresses the accessing host memory directly from within a kernel. Several advantages such as data transfers are implicitly performed as needed by the kernel and overlap data transfers with kernel execution without using streams. Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events to avoid any potential read-after-write, write-after-read, or write-after-write hazards.

Using Pageble host Memory : Example Program - cuda_malloc_test

The CUDA parallel programs use “pageble host” and “page-locked host (pinned) memory.” in which allocate a GPU buffer and a host buffer of matching sizes and then execute some number of copies between these two buffers. We perform “Host to Device” and “Device to Host” operations and use CUDA events for start and stop of the sequence of copies. The copy operation is performed 100 times and accurate timings are calculated. The program calls function cuda_malloc_test as explained above. The description of program is as follows:


    (Download source code : cuda-pageble-memory.cu )
Using Pageble host Memory : Example Program - cuda_malloc_test

#include <stdio.h> 
#include <cudaSafeCall.h> 


#define SIZE (64*1024*1024)

    float cuda_malloc_test( int size bool up){

      cudaEvent_t     start, stop;
      int                 *a, *dev_a;
      float               elapsedTime;

      CUDA_SAFE_CALL( cudaEventCreate( &start ) );
      CUDA_SAFE_CALL( cudaEventCreate( &stop ) );

     a = (int*)malloc( size * sizeof ( *a ) );
     CUDA_HANDLE_NULL( a );
     CUDA_SAFE_CALL( cudaMalloc( ( void**)&dev_a, size * sizeof ( *dev_a ) ) );

     CUDA_SAFE_CALL( cudaEventRecord( start, 0 ) );
     for(int i = 0; i < count; i++) {

If (up);
      CUDA_SAFE_CALL( cudaMemcpy( dev_a, a,
            size *
sizeof ( *dev_a ), cudaMemcpyHostToDevice ) );
else
      CUDA_SAFE_CALL( cudaMemcpy( dev_a, a,
            size *
sizeof ( *dev_a ), cudaMemcpyDeviceToHost ) );
    }
      CUDA_SAFE_CALL( cudaEventRecord( stop, 0 ) );
      CUDA_SAFE_CALL( cudaEventSynchronize( stop ) );
      CUDA_SAFE_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );


      free( a );
      CUDA_SAFE_CALL( cudaFree( dev_a ) );
      CUDA_SAFE_CALL( cudaEventDestroy( start ) );
      CUDA_SAFE_CALL( cudaEventDestroy( stop ) );

      return elapsedTime;
    }



Using Page-locked host Memory (Pinned Memory) :Allocating a host and GPU buffer & Perform Copies / CUDA Events - Timer

Another example code is given below using page-locked buffer in which cudaHostAlloc() is used.


    (Download source code : cuda-pinned-memory.cu )
Using Page-lock Memory : Example Program - cuda_malloc_test

#include <stdio.h> 
#include <cudaSafeCall.h> 


#define SIZE (64*1024*1024)

    float cuda_malloc_test( int size bool up){

      cudaEvent_t     start, stop;
      int                 *a, *dev_a;
      float               elapsedTime;

      CUDA_SAFE_CALL( cudaEventCreate( &start ) );
      CUDA_SAFE_CALL( cudaEventCreate( &stop ) );

     CUDA_SAFE_CALL( cudaHostAlloc( ( void**)&a, size * sizeof ( *a ),
                                          cudaHostAllocDefault ) );


     CUDA_SAFE_CALL( cudaMalloc( ( void**)&dev_a, size * sizeof ( *dev_a ) ) );

     CUDA_SAFE_CALL( cudaEventRecord( start, 0 ) );
     for(int i = 0; i < count; i++) {

If (up);
      CUDA_SAFE_CALL( cudaMemcpy( dev_a, a,
            size *
sizeof ( *dev_a ), cudaMemcpyHostToDevice ) );
else
      CUDA_SAFE_CALL( cudaMemcpy( dev_a, a,
            size *
sizeof ( *dev_a ), cudaMemcpyDeviceToHost ) );
    }
      CUDA_SAFE_CALL( cudaEventRecord( stop, 0 ) );
      CUDA_SAFE_CALL( cudaEventSynchronize( stop ) );
      CUDA_SAFE_CALL( cudaEventElapsedTime( &elapsedTime, start, stop ) );


      cudaFreeHost( a ) );
      CUDA_SAFE_CALL( cudaFree( dev_a ) );
      CUDA_SAFE_CALL( cudaEventDestroy( start ) );
      CUDA_SAFE_CALL( cudaEventDestroy( stop ) );

      return elapsedTime;
    }




The second CUDA parallel program is focussed on “passing parameters to a kernel and allocate memory on a device”. It performs addition of two vlaues and the program calls function kernel() as given in the function _global_ void add.

Responsibility of programmer : The programmer should aware of restrictions on the usage of device pointers which are summarized as follows.

  • Pass pointers allocated with cudaMalloc() to functions that execute on the device

  • Use pointers allocated with cudaMalloc() to read or write memory from code that executes on the device.

  • Pass pointers allocated with cudaMalloc() to read or write memory from code that executes on the device.

  • Cannot use pointers allocated with cudaMalloc() to read or write memory from code that executes on the host.

To free memory for allocated with cudaMalloc(), use a call cudaFree().

To access device memory - by using device pointers from within device code and by using calls to cudaMemcpy() from host code.

The last parameter to cudaMemcpy() is cudaMemcpyDeviceToHost instructing the runtime that the source pointer is a device pointer and the destination pointer is a host pointer.

cudaMemcpyHostToDevice instructing the source data is on the host and the destination is an address on the device.

Also, one can specifiy cudaMemcpyDeviceToDevice which indicates both pointers are on the device.


CUDA Asynchronous Concurrent Execution

Concurrent Execution between Host and Device :

CUDA Runtime supports asynchronous function calls which facilitate concurrent execution between host and device, some function calls. In execution of these function calls, the control is returned to the host thread before the device has completed the requested task. Some of these are:

  • Kernel launches;

  • Memory copies between two addresses to the same device memory;

  • Memory copies from host to device of a memory block of 64 KB or less;

  • Memory copies performed by functions that are suffixed with Async;

  • Memory set function calls.

Programmers can globally disable asynchronous kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only.

Overlap of Data Transfer and Kernel Execution

Some devices of compute capability 1.1 and higher can perform copies between page-locked host memory and device memory concurrently with kernel execution. Applications may query this capability by checking the asyncEngineCount device property, which is greater than zero for devices that support it. For devices of compute capability 1.x, this capability is only supported for memory copies that do not involve CUDA arrays or 2D arrays allocated through cudaMallocPitch().

Concurrent Kernel Execution

Some devices of compute capability 2.x can execute multiple kernels concurrently. Applications may query this capability by checking the concurrent Kernels device property, which is equal to 1 for devices that support it. The maximum number of kernel launches that a device can execute concurrently is sixteen. A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context. Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.

Concurrent Data Transfers

Some devices of compute capability 2.x can perform a copy from page-locked host memory to device memory concurrently with a copy from device memory to page-locked host memory. Applications may query this capability by checking the asyncEngineCount device property, which is equal to 2 for devices that support it.


CUDA Streams

Applications can manage concurrency through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. A CUDA represents a queue of GPU operations that get executed in a specific order. Several operations can be included into a stream and the order in which operations are added to the stream specifies the order in which they will be executed. Each stream can be viewed as a task on the GPU, and there are no opportunities for these tasks in parallel. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g. inter-kernel communication is undefined). Some of Stream features are:

  • Creation and Destruction;

  • Default Stream;

  • Explicit Synchronization;

  • Implicit Synchronization;

  • Overlapping Behavior.

Using a Single Stream & Multiple CUDA Streams

Application employ single or multiple CUDA Streams. When we use single stream, at the beginning of application, the computations are divided into chunks and each chunked computation and the overlap of memory copies with kernel execution.

On multiple streams, different streams will perform CUDA operations as per application requirements. For example stream 1 will do copy input buffers to the GPU, Stream 0 will execute its kernel while stream 0 copies its results to the host. The performance of application using single streams and the description of program is as follows:


    (Download source code - single CUDA Stream : single-cuda-stream.cu )
    (Download source code - Multiple CUDA Streams multiple-cuda-streams.cu )

Example Program : Using Single CUDA Stream

#include <stdio.h> 
#include <time.h> 
#include <cuda.h> 

/* Utility Macro : CUDA SAFE CALL */
void CUDA_SAFE_CALL( cudaError_t call)
{

cudaError_t ret = call;
switch(ret)
{
case cudaSuccess:
     
break;
default :
      {
printf(" ERROR at line :%i.%d' ' %s\n",
__LINE__,ret,cudaGetErrorString(ret));
exit(-1);
break;
     }
}
}

  _global_void vectvectadd (int *device_a, int *device_b, int *device_result,) {

      int threadId = threadIdx.x + blockIdx.x * blockDim.x
     if (tindex < sizeOfArray)

     device_result[tindex]= device_a[tindex]+device_b[tindex];


  }

  /* Check for safe return of all calls to the device */
int main ( int argc, **argv ) {
cudaDeviceProp prop;

int *host_a, *host_b, *host_result;
int *device_a, *device_b, *device_result;
int whichDevice;

CUDA_SAFE_CALL(cudaGetDeviceCount( &whichDevice) );
CUDA_SAFE_CALL(cudaGetDeviceProperties( &prop, whichDevice) );

    if(!prop.deviceOverlap)
    {
    printf("Device can'nt handle overlaps \n");
    return 0;
    }


      cudaEvent_t     start, stop;
      float               elapsedTime;

      CUDA_SAFE_CALL( cudaEventCreate( &start ) );
      CUDA_SAFE_CALL( cudaEventCreate( &stop ) );

      cudaStream_t stream;
      CUDA_SAFE_CALL(cudaStreamCreate(&stream));

     CUDA_SAFE_CALL( cudaMalloc( ( void**)&device_a,
                    sizeOfArray *
sizeof ( *device_a ) ) );
     CUDA_SAFE_CALL( cudaMalloc( ( void**)&device_b,
                    sizeOfArray *
sizeof ( *device_b ) ) );
     CUDA_SAFE_CALL( cudaMalloc( ( void**)&device_result,
                    sizeOfArray *
sizeof ( *device_result ) ) );

     CUDA_SAFE_CALL( cudaHostAlloc( ( void**)&host_a,
            size *
sizeofArray ( *a ), cudaHostAllocDefault ) );
     CUDA_SAFE_CALL( cudaHostAlloc( ( void**)&host_b,
            size *
sizeofArray ( *b ), cudaHostAllocDefault ) );
     CUDA_SAFE_CALL( cudaHostAlloc( ( void**)&host_result,
            size *
sizeofArray ( *host_result ), cudaHostAllocDefault ) );

for(int index = 0; index < sizeOfArray; index++)
{
host_a[index] = rand()%10;
host_b[index] = rand()%10;
}

CUDA_SAFE_CALL( cudaMemcpyAsync(device_a, host_a,
            sizeOfArray *
sizeof ( int ),
            cudaMemcpyHostToDevice, stream));


CUDA_SAFE_CALL( cudaMemcpyAsync(device_b, host_b,
            sizeOfArray *
sizeof ( int ),
            cudaMemcpyHostToDevice, stream));


/*Kernel call*/
arrayAddition<<<256, 1, 1>>>(device_a, device_b, device_result);

CUDA_SAFE_CALL( cudaMemcpyAsync(device_result, host_result,
            sizeOfArray *
sizeof ( int ),
            cudaMemcpyHostToDevice, stream));


CUDA_SAFE_CALL(cudaStreamSynchronize(stream));
CUDA_SAFE_CALL(cudaEventRecord(stop, 0));
CUDA_SAFE_CALL(cudaEventSynchronize(stop));
CUDA_SAFE_CALL(cudaEventElapsedTime(&elapsedTime, start, stop));

printf("*********** CDAC - Tech Workshop : hyPACK-2013 \n");
printf("\n Size of array : %d \n", sizeOfArray);
printf("\n Time taken: %3.1f ms \n", elapsedTime);

CUDA_SAFE_CALL(cudaFreeHost(host_a));
CUDA_SAFE_CALL(cudaFreeHost(host_b));
CUDA_SAFE_CALL(cudaFreeHost(host_result));
CUDA_SAFE_CALL(cudaFree(device_a));
CUDA_SAFE_CALL(cudaFree(device_b));
CUDA_SAFE_CALL(cudaFree(device_result));

return 0;

}



Centre for Development of Advanced Computing