List of Programs
Example 1.1
|
Write a OpenCL program to measure the time taken for different data sizes that copy ( write)
data from the pinned host memory to the device memory using
clEnqueueWriteBuffer()
|
Example 1.2
|
Measure time for OpenCL (blocking or non-blocking calls) and kernel executions using either
CPU or GPU timers (OpenCL GPU timers or OpenCL events)
|
Example 1.3
|
Code to measure the effective bandwidth for a 1024x1024 matrix (Single /Double Precision) using
Fast Memory Path and CompletePath and measure Ratio to Peak Bandwidth
|
Example 1.4
|
Write a program to calculate memory throughput using OpenCL visual profiler
|
Example 1.5
|
Analyze the differences in calculation of efficient memory bandwidth with memory throughput using OpenCL
visual profiler
|
Example 1.6
|
Write a code to analyze the performance of highly data parallel computation such as matrix-matrix computations on
GPUs in which each multiprocessor contains either 8,192 or 16,384 32-bit registers, these are partitioned among
concurrent threads.
|
Example 1.7
|
OpenCL Program to measure highest bandwidth between host and device based on
page-locked
|
Example 1.8
|
OpenCL Program to measure highest bandwidth between host and device based on page-locked
using blocking (synchronous transfer)
clEnqueueReadBuffer() / clEnqueueWriteBuffer()
call
|
Example 1.9
|
OpenCL Program to measure highest bandwidth between host and device based on page-locked memory
using non-blocking write (Asynchronous transfer)
clEnqueueWriteBuffer()
call with parameter
CL_FALSE
and blocking read from device to host using
clEnqueueReadBuffer()
call with parameter CL_TRUE
|
Example 1.10
|
OpenCL Program to measure highest bandwidth between host and device based on pinned memory
using non-blocking write (Asynchronous transfer)
clEnqueueWriteBuffer()
call with parameter
CL_FALSE
and blocking read from device to host using
clEnqueueReadBuffer()
call with parameter
CL_TRUE
|
Example 1.11
|
OpenCL : OpenCL program on Overlapping Transfers and Device Computation
oclCopyComputeOverlap
SDK.
The SDK sample "
oclCopyComputeOverlap
"
is devoted exclusively to exposition of the techniques required to
achieve concurrent copy and device computation and demonstrates the advantages of this relative to
purely synchronous operations. )
|
Example 1.12
|
Write test suites focusing on performance and scalability analysis of different size of the data
on different memory spaces i.e. 16 KB per thread limit on local memory, a
(OpenCL __private memory),
64 KB of constant memory
(OpenCL_constant memory)
, 16 KB of share memory
(OpenCL_local memory),
and either 8,192 or 16,384 32-bit registers per multi-processor.
|
Example 1.13
|
Write a code on how to use default work-group size at ccompile time, size of the work-group, role of
compiler to allocate the number of registers for work-item, & enough number of wave fronts )
|
Example 1.14
|
OpenL program for Matrix into Matrix Computation for different partition of matrices for coalescing global
memory accesses (a) A Simple Access Pattern - the kth thread accesses the kth word in a segment; the exception
is that not all threads need to participate; (b). A Sequential but Misaligned Access Pattern (sequential threads in a half warp
access memory but not aligned with the segments); (c) Effects of misaligned accessesl (d) Stride Access
|
Example 1.15
|
Simple OpenCL program that computes matrix into vector multiply choosing the best optimized NDRange in which
optimized number of work-items is launched. Setting up right worksize in
clEnqueueNDRangeKernel()
and number of work-items, a multiple of the warp size (i.e. 32), can be explored.
|
Example 1.16
|
Simple OpenCL program that computes the product w of a width x height matrix
M by a vector V in which global
memory access are coalesced and the kernel must be rewritten to have each work-group,
as opposed to each work-item, compute elements of W. (Each work-item is now responsible for
calculating part of the dot product of V and a row of M and storing it to OpenCL local memory)
|
Example 1.17
|
OpenCL Parallel reduction (a) with shared memory Effects of Misaligned Accesses bank conflicts, (b)
without shared memory Effects of Misaligned Accesses bank conflicts, (c) warp based parallelism
|
Example 1.18
|
OpenCL Code for matrix-matrix multiply C = AAT based on (a) strided accesses to global memory, (b)
shared memory bank conflicts, (An optimized version using coalesced reads from global memory)
|
Example 1.19
|
AMD-APP (using CAL) OpenCL : Write a simple HelloCAL application program using CAL of AMD Accelerated
Parallel Processing Technology
calculating part of the dot product of V and a row of M and storing it to OpenCL local memory)
|
Example 1.20
|
AMD-APP: Write a Direct memory access (DMA) code to move data between the system memory and GPU local
memory using
CALMemCopyof AMD APP
|
Example 1.21
|
AMD-APP : Write a code to use AMD-APP Asynchronous operations using CAL API of an application that must
perform CPU computations in the application thread and also run another kernel on the AMD-APP stream processor
|
Example 1.22
|
AMD-APP : Multiple device Write a matrix into vector multiplication based on self-scheduling algorithm using AMD-APP CAL CAL
Application using Multiple Stream Processors Using
calDeviceGetCount
AMD-APP : CAL API of ADM-APP and ensure
that application-created threads that are created on the CPU and are used to manage the communication with
individual stream processors.
|