NVIDIA GPU OpenCL
Architecture : The CUDA Architecture is a close to the OpenCL architecture. A CUDA device is build around a
scalable array of multi threaded Streaming Multiprocessor (SMs).
A multiprocessor corresponds to an OpenCL compute unit.
A multiprocessor executes a CUDA thread for each OpenCL work-item and a thread block for
each OpenCL work-group. A kernel is executed over an OpenCL and NDrange by a grid of thread blocks.
Each of the thread blocks that execute kernels is therefore uniquely identified by its work-group ID,
and each thread by its global ID or by a combination of its local ID and work-group ID.
A thread is also given a unique thread ID within its block. When an OpenCL program on the host
invokes a kernel, the work-groups are enumerated and distributed as thread blocks to the multi-processors
with available execution capacity. The threads of thread block execute concurrently on one
Multi-processor. A thread blocks terminate, new blocks are launched on the vacated
multi-processors.
Memory Model : Each multi-processor of NVIDIA CUDA architecture has on-chip memory of the four following types:
-
One set of local 32-bit register per processor,
-
A parallel data cache or share memory that is shared by all scalar processor cores and is where OpenCL local memory resides
-
A read-only constant cache that is shared by all the scalar processors cores and speeds up reads from OpenCL constant memory.
-
A read only texture cache that is shared by all scalar processor cores and speed up reads from OpenCL image objects, each multi-processor cores and speeds up reads from OpenCL image objects, each multi-processor access the texture cache via texture unit that implements the various addressing modes and date filtering specified by OpenCL sampler objects; the region of device memory addressed by image is referred to a texture memory.
There is also a global memory address space that is used for OpenCL global memory and a local memory address space that is private to each thread (and should not be confused with OpenCL local memory). Both memory spaces are read-write regions of device memory and are not cached.
List of Programs : NVIDIA GPU - CUDA
The matrix multiplication examples illustrate the typical data parallel approach used by OpenCL
applications to achieve good performance on GPUs.
It illustrates the use of OpenCL local memory that maps to share memory on the CUDA architecture.
Shared memory is much faster than the global memory and implementation based on shared memory
accesses give improvement in performance for typical dense matrix computations.
Example programs that take advantage of shared memory are illustrated.
Experts may discuss performance guidelines, focusing on Instruction Performance, Memory Bandwidth Issues,
Shared Memory, NDRange & execution time of a kernel launch on the OpenCL implementation,
Data transfer between Host and Device, Warp level synchronization issues,
and overall performance optimization strategies.
|