C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-4 GPGPUs • NVIDIA - CUDA/OpenCL • AMD APP OpenCL • GPGPUs - OpenCL • GPGPUs : Power & Perf. • Home

hyPACK-2013 GPU Programming using NVidia OpenCL - CUDA Architechture

The NVIDIA 's Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the company's powerful GPUs. The NVIDIA CUDA technology is a fundamentally new computing architecture that enables the GPU to solve complex computational problems. CUDA technology gives computationally intensive applications access to the processing power of NVIDIA graphics processing units (GPUs) through a new, programming interface. CUDA is a software platform for massively parallel high-performance computing on the NVIDIA's powerful GPUs.

NVIDIA GPU OpenCL OpenCL Programs NVIDIA GPU OpenCL SDK

Source & References : GPGPU & GPU Computing

NVIDIA GPU OpenCL

Architecture : The CUDA Architecture is a close to the OpenCL architecture. A CUDA device is build around a scalable array of multi threaded Streaming Multiprocessor (SMs). A multiprocessor corresponds to an OpenCL compute unit. A multiprocessor executes a CUDA thread for each OpenCL work-item and a thread block for each OpenCL work-group. A kernel is executed over an OpenCL and NDrange by a grid of thread blocks. Each of the thread blocks that execute kernels is therefore uniquely identified by its work-group ID, and each thread by its global ID or by a combination of its local ID and work-group ID. A thread is also given a unique thread ID within its block. When an OpenCL program on the host invokes a kernel, the work-groups are enumerated and distributed as thread blocks to the multi-processors with available execution capacity. The threads of thread block execute concurrently on one Multi-processor. A thread blocks terminate, new blocks are launched on the vacated multi-processors.

Memory Model : Each multi-processor of NVIDIA CUDA architecture has on-chip memory of the four following types:

One set of local 32-bit register per processor,
A parallel data cache or share memory that is shared by all scalar processor cores and is where OpenCL local memory resides
A read-only constant cache that is shared by all the scalar processors cores and speeds up reads from OpenCL constant memory.
A read only texture cache that is shared by all scalar processor cores and speed up reads from OpenCL image objects, each multi-processor cores and speeds up reads from OpenCL image objects, each multi-processor access the texture cache via texture unit that implements the various addressing modes and date filtering specified by OpenCL sampler objects; the region of device memory addressed by image is referred to a texture memory.

There is also a global memory address space that is used for OpenCL global memory and a local memory address space that is private to each thread (and should not be confused with OpenCL local memory). Both memory spaces are read-write regions of device memory and are not cached.

List of Programs : NVIDIA GPU - CUDA

The matrix multiplication examples illustrate the typical data parallel approach used by OpenCL applications to achieve good performance on GPUs. It illustrates the use of OpenCL local memory that maps to share memory on the CUDA architecture. Shared memory is much faster than the global memory and implementation based on shared memory accesses give improvement in performance for typical dense matrix computations. Example programs that take advantage of shared memory are illustrated.

Experts may discuss performance guidelines, focusing on Instruction Performance, Memory Bandwidth Issues, Shared Memory, NDRange & execution time of a kernel launch on the OpenCL implementation, Data transfer between Host and Device, Warp level synchronization issues, and overall performance optimization strategies.

NVIDIA GPU OpenCL SDK

OpenCL Device Query Test : This sample enumerates the properties of the OpenCL devices present in the system.

OpenCL Bandwidth Test This is a simple test program to measure the memcopy bandwidth of the GPU. It currently is capable of measuring device to device copy bandwidth, host to device and host to device copy bandwidth for pageable and page-locked memory, memory mapped and direct access.

OpenCL Vector Addition Element by element addition of two 1-dimensional arrays. Implemented in OpenCL for CUDA GPU's, with functional comparison against a simple C++ host CPU implementation.

OpenCL Dot Product Dot Product (scalar product) of set of input vector pairs. Implemented in OpenCL for CUDA GPU's, with functional comparison against a simple C++ host CPU implementation.

OpenCLMatrix Vector Multiplication Simple matrix-vector multiplication example showing increasingly optimized implementations.

OpenCL Simple Multi-GPU

OpenCL Simple Integer op

OpenCL Scan

Parallel Reduction

OpenCL Matrix Transpose

OpenCL Matrix Multiplication

OpenCL 3D FFT

The other CUDA OpenCL codes can be downloaded from

http://developer.download.nvidia.com/compute/opencl/sdk/

Some of the CUDA OpenCL codes are :

OpenCL DCT 8x8

OpenCL DirectX Texture Compressor (DXTC)

OpenCL Radix Sort

OpenCL Sorting Networks

OpenCL Black-Scholes Option Pricing

OpenCL Quasirandom Generator

OpenCL Mersenne Twister

OpenCL 64-bin and 256-bin Histogram

OpenCL Post-Process OpenGL-Rendered Image

Simple Texture 3D

OpenCL Box Filter 8x8

OpenCL Sobel Filter

OpenCL Median Filter

OpenCL Separable Convolution

OpenCL Recursive Gaussian Filter

OpenCL Volume rendering

OpenCL Particle Collision Simulation

OpenCL N-Body Physics Simulation

Centre for Development of Advanced Computing