• Mode-4 GPGPUs • NVIDIA - CUDA/OpenCL • AMD APP OpenCL • GPGPUs - OpenCL • GPGPUs : Power & Perf. • Home




hyPACK-2013 GPU Programming using NVidia OpenCL - CUDA Architechture

The NVIDIA 's Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the company's powerful GPUs. The NVIDIA CUDA technology is a fundamentally new computing architecture that enables the GPU to solve complex computational problems. CUDA technology gives computationally intensive applications access to the processing power of NVIDIA graphics processing units (GPUs) through a new, programming interface. CUDA is a software platform for massively parallel high-performance computing on the NVIDIA's powerful GPUs.

NVIDIA GPU OpenCL     OpenCL Programs     NVIDIA GPU OpenCL SDK

Source & References : GPGPU & GPU Computing    

NVIDIA GPU OpenCL

Architecture : The CUDA Architecture is a close to the OpenCL architecture. A CUDA device is build around a scalable array of multi threaded Streaming Multiprocessor (SMs). A multiprocessor corresponds to an OpenCL compute unit. A multiprocessor executes a CUDA thread for each OpenCL work-item and a thread block for each OpenCL work-group. A kernel is executed over an OpenCL and NDrange by a grid of thread blocks. Each of the thread blocks that execute kernels is therefore uniquely identified by its work-group ID, and each thread by its global ID or by a combination of its local ID and work-group ID. A thread is also given a unique thread ID within its block. When an OpenCL program on the host invokes a kernel, the work-groups are enumerated and distributed as thread blocks to the multi-processors with available execution capacity. The threads of thread block execute concurrently on one Multi-processor. A thread blocks terminate, new blocks are launched on the vacated multi-processors.

Memory Model : Each multi-processor of NVIDIA CUDA architecture has on-chip memory of the four following types:

  • One set of local 32-bit register per processor,

  • A parallel data cache or share memory that is shared by all scalar processor cores and is where OpenCL local memory resides

  • A read-only constant cache that is shared by all the scalar processors cores and speeds up reads from OpenCL constant memory.

  • A read only texture cache that is shared by all scalar processor cores and speed up reads from OpenCL image objects, each multi-processor cores and speeds up reads from OpenCL image objects, each multi-processor access the texture cache via texture unit that implements the various addressing modes and date filtering specified by OpenCL sampler objects; the region of device memory addressed by image is referred to a texture memory.

There is also a global memory address space that is used for OpenCL global memory and a local memory address space that is private to each thread (and should not be confused with OpenCL local memory). Both memory spaces are read-write regions of device memory and are not cached.

List of Programs : NVIDIA GPU - CUDA

The matrix multiplication examples illustrate the typical data parallel approach used by OpenCL applications to achieve good performance on GPUs. It illustrates the use of OpenCL local memory that maps to share memory on the CUDA architecture. Shared memory is much faster than the global memory and implementation based on shared memory accesses give improvement in performance for typical dense matrix computations. Example programs that take advantage of shared memory are illustrated.

Experts may discuss performance guidelines, focusing on Instruction Performance, Memory Bandwidth Issues, Shared Memory, NDRange & execution time of a kernel launch on the OpenCL implementation, Data transfer between Host and Device, Warp level synchronization issues, and overall performance optimization strategies.


NVIDIA GPU OpenCL SDK

OpenCL Device Query Test : This sample enumerates the properties of the OpenCL devices present in the system.

OpenCL Bandwidth Test This is a simple test program to measure the memcopy bandwidth of the GPU. It currently is capable of measuring device to device copy bandwidth, host to device and host to device copy bandwidth for pageable and page-locked memory, memory mapped and direct access. OpenCL Vector Addition Element by element addition of two 1-dimensional arrays. Implemented in OpenCL for CUDA GPU's, with functional comparison against a simple C++ host CPU implementation. OpenCL Dot Product Dot Product (scalar product) of set of input vector pairs. Implemented in OpenCL for CUDA GPU's, with functional comparison against a simple C++ host CPU implementation. OpenCLMatrix Vector Multiplication Simple matrix-vector multiplication example showing increasingly optimized implementations. OpenCL Simple Multi-GPU OpenCL Simple Integer op OpenCL Scan Parallel Reduction OpenCL Matrix Transpose OpenCL Matrix Multiplication OpenCL 3D FFT The other CUDA OpenCL codes can be downloaded from

http://developer.download.nvidia.com/compute/opencl/sdk/

Some of the CUDA OpenCL codes are :

  • OpenCL DCT 8x8

  • OpenCL DirectX Texture Compressor (DXTC)

  • OpenCL Radix Sort

  • OpenCL Sorting Networks

  • OpenCL Black-Scholes Option Pricing

  • OpenCL Quasirandom Generator

  • OpenCL Mersenne Twister

  • OpenCL 64-bin and 256-bin Histogram

  • OpenCL Post-Process OpenGL-Rendered Image

  • Simple Texture 3D

  • OpenCL Box Filter 8x8

  • OpenCL Sobel Filter

  • OpenCL Median Filter

  • OpenCL Separable Convolution

  • OpenCL Recursive Gaussian Filter

  • OpenCL Volume rendering

  • OpenCL Particle Collision Simulation

  • OpenCL N-Body Physics Simulation

Centre for Development of Advanced Computing