hyPACK-2013 Mode-2 : GPGPU OpenCL Prog. using AMD-APP

AMD Accelerated Parallel Processing (AMD APP) SOftware harnesses the tremendous processing power of GPUs for high-performance, data-parallel computing in a wide range of applications. The AMD Accelerated Parallel Processing system includes a software stack and the AMD GPUs. AMD-APP SDK provides complete heterogeneous OpenCL development platform for both the CPU and GPU. The software includes OpenCL compiler & runtime, Device Driver for GPU compute device - AMD Compute Abstraction Layer (CAL), Performance Profiling Tools - AMD APP Profiler and AMD APP KernelAnalyzer and Performance Libraries - AMD Core Math Library (ACML). AMD-APP OpenCL software development platform is available for x86-based CPUs and it provides complete heterogeneous OpenCL development platform for both the CPU and GPU. Please refer to AMD APP Programming Guide to understand more about OpenCL and AMD APP components.

List of Programs

Example 1.1	Write a OpenCL program to measure the time taken for different data sizes that copy ( write) data from the pinned host memory to the device memory using clEnqueueWriteBuffer()
Example 1.2	Measure time for OpenCL (blocking or non-blocking calls) and kernel executions using either CPU or GPU timers (OpenCL GPU timers or OpenCL events)
Example 1.3	Code to measure the effective bandwidth for a 1024x1024 matrix (Single /Double Precision) using Fast Memory Path and CompletePath and measure Ratio to Peak Bandwidth
Example 1.4	Write a program to calculate memory throughput using OpenCL visual profiler
Example 1.5	Analyze the differences in calculation of efficient memory bandwidth with memory throughput using OpenCL visual profiler

Example 1.6	Write a code to analyze the performance of highly data parallel computation such as matrix-matrix computations on GPUs in which each multiprocessor contains either 8,192 or 16,384 32-bit registers, these are partitioned among concurrent threads.
Example 1.7	OpenCL Program to measure highest bandwidth between host and device based on page-locked
Example 1.8	OpenCL Program to measure highest bandwidth between host and device based on page-locked using blocking (synchronous transfer) clEnqueueReadBuffer() / clEnqueueWriteBuffer() call
Example 1.9	OpenCL Program to measure highest bandwidth between host and device based on page-locked memory using non-blocking write (Asynchronous transfer) clEnqueueWriteBuffer() call with parameter CL_FALSE and blocking read from device to host using clEnqueueReadBuffer() call with parameter CL_TRUE
Example 1.10	OpenCL Program to measure highest bandwidth between host and device based on pinned memory using non-blocking write (Asynchronous transfer) clEnqueueWriteBuffer() call with parameter CL_FALSE and blocking read from device to host using clEnqueueReadBuffer() call with parameter CL_TRUE

Example 1.11	OpenCL : OpenCL program on Overlapping Transfers and Device Computation oclCopyComputeOverlap SDK. The SDK sample " oclCopyComputeOverlap " is devoted exclusively to exposition of the techniques required to achieve concurrent copy and device computation and demonstrates the advantages of this relative to purely synchronous operations. )
Example 1.12	Write test suites focusing on performance and scalability analysis of different size of the data on different memory spaces i.e. 16 KB per thread limit on local memory, a (OpenCL __private memory), 64 KB of constant memory (OpenCL_constant memory) , 16 KB of share memory (OpenCL_local memory), and either 8,192 or 16,384 32-bit registers per multi-processor.
Example 1.13	Write a code on how to use default work-group size at ccompile time, size of the work-group, role of compiler to allocate the number of registers for work-item, & enough number of wave fronts )
Example 1.14	OpenL program for Matrix into Matrix Computation for different partition of matrices for coalescing global memory accesses (a) A Simple Access Pattern - the kth thread accesses the kth word in a segment; the exception is that not all threads need to participate; (b). A Sequential but Misaligned Access Pattern (sequential threads in a half warp access memory but not aligned with the segments); (c) Effects of misaligned accessesl (d) Stride Access
Example 1.15	Simple OpenCL program that computes matrix into vector multiply choosing the best optimized NDRange in which optimized number of work-items is launched. Setting up right worksize in clEnqueueNDRangeKernel() and number of work-items, a multiple of the warp size (i.e. 32), can be explored.

Example 1.16	Simple OpenCL program that computes the product w of a width x height matrix M by a vector V in which global memory access are coalesced and the kernel must be rewritten to have each work-group, as opposed to each work-item, compute elements of W. (Each work-item is now responsible for calculating part of the dot product of V and a row of M and storing it to OpenCL local memory)
Example 1.17	OpenCL Parallel reduction (a) with shared memory Effects of Misaligned Accesses bank conflicts, (b) without shared memory Effects of Misaligned Accesses bank conflicts, (c) warp based parallelism
Example 1.18	OpenCL Code for matrix-matrix multiply C = AA^T based on (a) strided accesses to global memory, (b) shared memory bank conflicts, ( an optimized version using coalesced reads from global memory)
Example 1.19	AMD-APP (using CAL) OpenCL : Write a simple HelloCAL application program using CAL of AMD Accelerated Parallel Processing Technology calculating part of the dot product of V and a row of M and storing it to OpenCL local memory)
Example 1.20	AMD-APP: Write a Direct memory access (DMA) code to move data between the system memory and GPU local memory using CALMemCopyof AMD APP

Example 1.21	AMD-APP : Write a code to use AMD-APP Asynchronous operations using CAL API of an application that must perform CPU computations in the application thread and also run another kernel on the AMD-APP stream processor
Example 1.22	AMD-APP : Multiple device Write a matrix into vector multiplication based on self-scheduling algorithm using AMD-APP CAL CAL Application using Multiple Stream Processors Using calDeviceGetCount AMD-APP : CAL API of ADM-APP and ensure that application-created threads that are created on the CPU and are used to manage the communication with individual stream processors.

Centre for Development of Advanced Computing