C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-4 GPGPUs • NVIDIA - CUDA/OpenCL • AMD APP OpenCL • GPGPUs - OpenCL • GPGPUs : Power & Perf. • Home

Programming on Heterogeneous computing Platforms : OpencL

The Open Computing Language is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the heterogeneous platform. The OpenCL provides an opportunity for developers to effectively use the multiple heterogeneous compute resources on their CPUs, GPUs and other processors. Developers wanted to be able to access them together, not just off load the GPU for specific tasks and the OpenCL programming environment may address these issues. OpenCL supports wide range of applications, ranging from embedded and consume software to HPC solutions, through a low-level, high-performance, portable abstraction. It is expected that OpenCL will form the foundation layer of a parallel programming Eco-system of platform-independent tools, middle-ware, and applications.

List of Programs

Example 1.1	Write a OpenCL program to measure the time taken for different data sizes that copy ( write) data from the pinned host memory to the device memory using clEnqueueWriteBuffer()
Example 1.2	Measure time for OpenCL (blocking or non-blocking calls) and kernel executions using either CPU or GPU timers (OpenCL GPU timers or OpenCL events)
Example 1.3	Code to measure the effective bandwidth for a 1024x1024 matrix (Single /Double Precision) using Fast Memory Path and CompletePath and measure Ratio to Peak Bandwidth
Example 1.4	Write a program to calculate memory throughput using OpenCL visual profiler
Example 1.5	Analyze the differences in calculation of efficient memory bandwidth with memory throughput using OpenCL visual profiler
Example 1.6	Write a code to analyze the performance of highly data parallel computation such as matrix-matrix computations on GPUs in which each multiprocessor contains either 8,192 or 16,384 32-bit registers, these are partitioned among concurrent threads.
Example 1.7	OpenCL Program to measure highest bandwidth between host and device based on page-locked
Example 1.8	OpenCL Program to measure highest bandwidth between host and device based on page-locked using blocking (synchronous transfer) clEnqueueReadBuffer() / clEnqueueWriteBuffer() call
Example 1.9	OpenCL Program to measure highest bandwidth between host and device based on page-locked memory using non-blocking write (Asynchronous transfer) clEnqueueWriteBuffer() call with parameter CL_FALSE and blocking read from device to host using clEnqueueReadBuffer() call with parameter CL_TRUE
Example 1.10	OpenCL Program to measure highest bandwidth between host and device based on pinned memory using non-blocking write (Asynchronous transfer) clEnqueueWriteBuffer() call with parameter CL_FALSE and blocking read from device to host using clEnqueueReadBuffer() call with parameter CL_TRUE
Example 1.11	OpenCL : OpenCL program on Overlapping Transfers and Device Computation oclCopyComputeOverlap SDK. The SDK sample " oclCopyComputeOverlap " is devoted exclusively to exposition of the techniques required to achieve concurrent copy and device computation and demonstrates the advantages of this relative to purely synchronous operations. )
Example 1.12	Write test suites focusing on performance and scalability analysis of different size of the data on different memory spaces i.e. 16 KB per thread limit on local memory, a (OpenCL __private memory), 64 KB of constant memory (OpenCL_constant memory) , 16 KB of share memory (OpenCL_local memory), and either 8,192 or 16,384 32-bit registers per multi-processor.
Example 1.13	Write a code on how to use default work-group size at ccompile time, size of the work-group, role of compiler to allocate the number of registers for work-item, & enough number of wave fronts )
Example 1.14	OpenL program for Matrix into Matrix Computation for different partition of matrices for coalescing global memory accesses (a) A Simple Access Pattern - the kth thread accesses the kth word in a segment; the exception is that not all threads need to participate; (b). A Sequential but Misaligned Access Pattern (sequential threads in a half warp access memory but not aligned with the segments); (c) Effects of misaligned accessesl (d) Stride Access
Example 1.15	Simple OpenCL program that computes matrix into vector multiply choosing the best optimized NDRange in which optimized number of work-items is launched. Setting up right worksize in clEnqueueNDRangeKernel() and number of work-items, a multiple of the warp size (i.e. 32), can be explored.
Example 1.16	Simple OpenCL program that computes the product w of a width x height matrix M by a vector V in which global memory access are coalesced and the kernel must be rewritten to have each work-group, as opposed to each work-item, compute elements of W. (Each work-item is now responsible for calculating part of the dot product of V and a row of M and storing it to OpenCL local memory)
Example 1.17	OpenCL Parallel reduction (a) with shared memory Effects of Misaligned Accesses bank conflicts, (b) without shared memory Effects of Misaligned Accesses bank conflicts, (c) warp based parallelism
Example 1.18	OpenCL Code for matrix-matrix multiply C = AA^T based on (a) strided accesses to global memory, (b) shared memory bank conflicts, (An optimized version using coalesced reads from global memory)
Example 1.19	AMD-APP (using CAL) OpenCL : Write a simple HelloCAL application program using CAL of AMD Accelerated Parallel Processing Technology calculating part of the dot product of V and a row of M and storing it to OpenCL local memory)
Example 1.20	AMD-APP: Write a Direct memory access (DMA) code to move data between the system memory and GPU local memory using CALMemCopyof AMD APP
Example 1.21	AMD-APP : Write a code to use AMD-APP Asynchronous operations using CAL API of an application that must perform CPU computations in the application thread and also run another kernel on the AMD-APP stream processor
Example 1.22	AMD-APP : Multiple device Write a matrix into vector multiplication based on self-scheduling algorithm using AMD-APP CAL CAL Application using Multiple Stream Processors Using calDeviceGetCount AMD-APP : CAL API of ADM-APP and ensure that application-created threads that are created on the CPU and are used to manage the communication with individual stream processors.

Centre for Development of Advanced Computing