hyPACK-2013 GPGPU OpenCL Prog. using AMD-APP
AMD Accelerated Parallel Processing (AMD APP) SOftware harnesses the tremendous processing power of GPUs for high-performance,
data-parallel computing in a wide range of applications. The AMD Accelerated Parallel Processing system
includes a software stack and the AMD GPUs.
AMD-APP SDK provides complete heterogeneous
OpenCL development platform for both the CPU and GPU. The software includes
OpenCL compiler & runtime,
Device Driver for GPU compute device \96 AMD
Performance Profiling Tools \96 AMD
APP Profiler and AMD APP KernelAnalyzer and Performance Libraries \96 AMD Core Math Library (ACML).
AMD-APP OpenCL software development platform is available for x86-based CPUs and it provides complete heterogeneous
OpenCL development platform for both the CPU and GPU.Please refer to AMD-APP Accelerated Parallel Processing (AMD APP) Programming Guide OpenCL to
understand the relationship of the AMD Accelerated Parallel Processing components.
|
|
Introduction - OpenCL Standard 1.2
|
OpenCL is an industry standard framework for programming computers composed of combination of CPUs,
GPUs, and other processors. These computing systems can be called as Heterogeneous Computing
platforms which address issues of programming of complex applications in Scientific and
Engineering and Commercial domain applications. The OpenCL standard was first released in month
of December of the year 2008 and is relatively new technology.
The OpenCL provides an opportunity for developers to effectively use the multiple heterogeneous
compute resources on their CPUs, GPUs and other processors. OpenCL supports wide range of
applications, ranging from embedded and consume software to HPC solutions, through a low-level,
high-performance, portable abstraction. It is expected that OpenCL will form the foundation layer
of a parallel programming eco-system of platform-independent tools, middleware, and applications.
OpenCL is being created by the Khronos group with the participation of many industry-leading
companies such as AMD, Apple, IBM, Intel, Imagination Technologies, Motorola, NVIDIA and others.
OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of
modern processors found in personal computers, servers and handheld/embedded devices.
OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide
spectrum of applications in various discipline areas like gaming & entertainment
as well as scientific and medical software
OpenCL programs can run on a wide range of computing systems, from cell phones, to laptops,
multi-core processors, to nodes in massively super-computers. OpenCL provides a uniform
programming environment for software developers to write efficient, portable code for
high-performance compute servers, desktop computer systems and handheld devices using
a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors
such as DSPs. OpenCL is a framework for parallel programming and includes a language, API,
libraries, and runtime system to support software development. Using OpenCL, a programmer
can write general purpose programs that execute on GPUs without need to map their
algorithms onto a 3D graphics API such as OpenGL.
Most importantly, OpenCL delivers high levels of portability by exposing the hardware,
not by hiding it behind elegant abstractions. OpenCL Programmers have an opportunity to
explicitly define the platform, its context, and how the work is scheduled onto different
devices. The OpenCL framework can be a good foundation and can download from
http://www.khronous.org/opencl/
|
Conceptual Foundations of OpenCL
|
The common programming models such as data parallelism and
task parallelism
are mapped onto real hardware of heterogeneous computing systems is a
challenging task. Here, the computational elements in the system may
have different instruction sets and different memory architectures
and may run at different speeds. An effective program should map the
parallel software onto the most suitable OpenCL devices of heterogeneous
platforms. The openCL programming exposes heterogeneity and is aimed
to increase the abstraction and in OpenCL, the high-level frameworks
simply the programming problem map onto the high-level languages,
which is turn map to a low level hardware abstraction layer for portability.
OpenCL supports a wide range of applications that run on heterogeneous
computing systems and the following steps are carried out.
Discover the completion that make-up the heterogeneous system
Probe the characteristics of these components, so that the software can adapt to
specific features of different hardware elements
Create the blocks of instructions (Kernels) that will run on the platform
Discover the completion that make-up the heterogeneous system
Probe the characteristics of these components, so that the software can adapt to specific
features of different hardware elements
Create the blocks of instructions (Kernels) that will run on the platform
The above steps are accomplished through series of APIs inside OpenCL, plus
a programming environment for the kernels.
The OpenCL Programming model is divided into four models such as
Platform model, execution model, platform model and Programming model.
|
OpenCL - Platform Model
|
OpenCL Platform model is a high-level description of the heterogeneous
system. The model specifies that there is one processor coordinating
execution (the host) and one or more processors capable of executing
OpenCL C code (the devices). It defines an abstract hardware model
that is used by programmers when writing OpenCL C functions
(called kernels) that execute on the devices. The platform model
defines a relationship between the host and the device.
The host is connected to one or more OpenCL devices and the
device is where the stream of instructions (or kernel) executes
and openCL device is referred to as a compute device.
A device can be a X86 CPU, a GPU, a DSP, or any other processor
provided by the hardware and supported be the OpenCL vendor.
Platforms can be thought of as vendor-specific implementations
of the OpenCL API.
The platform model also presents abstract device architecture
that programmers target when OpenCL C code. Vendors map this
abstract architecture to the physical hardware. The OpenCL
devices are further divided into compute units which are further
divided into one or more processing elements (PEs). Each compute
unit functionally is independent from the rest. The platform device
model closely corresponds to the hardware model of some GPUs.
The OpenCL devices are further divided into compute units which are
further divided int one or more processing elelments (PEs). Each compute
unit functionally independent from the rest. The platform device model
closely corresponds to the hardware model of some GPUs.
-
The Platform model consists of a host connected to one or more OpenCL devices.
-
An OpenCL device is divided into one or more compute units (CUs), which are further divided into one or
more processing elements (PEs). Computations on a device occur within the processing elements.
-
An OpenCL application runs on a host according to the models, particular to the host platform.
-
The OpenCL application submits commands from the host to execute computations on the processing
elements within a device.
-
The processing elements within a compute unit execute a single stream of instructions as SIMD units or SPMD units.
For example, the AMD FireStream 9350 GPU card comprises 1440 SIMD Cores
(compute units), with each SIMD core containing 18 lanes
(processing elements) and peak of 2.0 TFLOPS in single precision
floating point.
Fire Stream 9350 GPU Data Sheet.
The API function
clGetPlatformIDs() is used to discover the set of available
platforms for a given system.
|
cl_int
cl_GetPlatformIDs(cl_unit num_entries,
cl_platfrom_id *platforms,
cl_unit *num_platforms)
|
The API function
clGetPlatformIDs() is called twice in the code.
Afer platforms have been discovered, the API function
clGetPlatformInfo() can be used to determine
which implementation (vendor) the platform was defined by.
The API function
clGetDeviceIDs() can be called to know about the type
of devices. It works similar to
clGetPlatformIDs() and it takes additional arguments
of a platform and a device type.
The API function
device_type can be used to limit the
devices to GPUs only (
CL_DEVICE_TYPE_GPU),
devices to CPUs only (
CL_DEVICE_TYPE_CPU),
all devices to GPUs only (
CL_DEVICE_TYPE_ALL), as well as other
options.
The API call
clGetDeviceInfo() is called to retrieve
information such as name, type, and vendor from each device.
|
cl_int
cl_GetDeviceIDs(cl_platform_id platform,
cl_device_type device_type,
cl_unit num_entries)
cl_unit *num_platforms)
|
The
CLInfo program in the AMD APP SDK uses
clGetPlatformInfo() and
clGetDeviceInfo() commands to print
detailed information about the OpenCL,
supported platform and devices in a system.
A snippet of the output from the
CLInfo is given below program.
|
OpenCL CLInfo Program print-out :
Feature
|
Description
|
Number of Platforms :
|
1
|
Platform Profile :
|
FULL_PROFILE
|
Platform Version :
|
OpenCL 1.1 AMD-APP SDK-v2.4
|
Platform Name :
|
AMD Aceelerated Parallel Processing
|
Platform Vendor :
|
Advanced Micro Devices, In.,
|
No. of devices :
|
2
|
Device Type :
|
CL_DEVICE_TYPE_GPU
|
Name :
|
Cypress
|
Max compute units :
|
20
|
Address bits :
|
32
|
Max Memory allocations :
|
268435456
|
Global Memory size :
|
1073741824
|
Constant buffer size :
|
65536
|
Local Memory type :
|
Scaratchpad
|
Local Memory size :
|
32768
|
Device endianess:
|
Little
|
|
|
Device Type :
|
CL_DEVICE_TYPE_CPU
|
Max Compute units :
|
16
|
Max Compute units :
|
16
|
Name :
|
AMd Phenom (tm) 11 X4 X4 946 Processor
|
OpenCL - Execution Model
|
An OpenCL application consists of host program
and a collection of one or more kernels. The host
program runs on thee host. OpenCL defines
how interacts with the objects defined within OpenCL.
The kernel executes on the OpenCL Devices and they do the
real work of an application.
-
Execution of an OpenCL program occurs in two parts: a host program that executes on the particular host
platform and kernels that execute on one or more OpenCL devices.
-
The core of the OpenCL execution model is defined by the kernels execute. The concepts of kernel instance are called a
work-item and these work-items are organized into Work-groups.
-
Execution Model: Context and Command Queues - The host defines a context for the execution of the kernels.
The context includes
Devices : The collection of OpenCL devices to be used by the host;
Kernels The OpenCL functions that run on OpenCL devices);
Program Objects: The program source and executable that implement the kernels);
Memory Objects : A set of memory objects visible to the host and the OpenCL devices Memory objects contain values that
can be operated on by instances of a kernel.
Kernels are typically simple
functions that transform input memory objects into output
memory objects. Opencl defines two kernels.
OpenCL Kernels : functions written with the OpenCL programming
language and compiled with the OpenCL compiler. An OpenCL
implementation must support OpenCL kernels.
Native Kernels : functions created outside of OpenCL and
accessed within OpenCL through a pointer. These functions could be
functions defined in the host source code or exported from a
specialized library.
The OpenCL execution model defines how the kernels execute and
it can be explanied into several parts.
First, we explain how an individual kernel runs on on OpenCL
device and then we describe how the host defines the context for
kernel execution and how the kernels are enqueued for execution.
How a kernel is executed on an OpenCL Device
A kernel is defines on the host. The host program issues a command
that submits the kernel for execution on an OpenCL device.
When this command issued by the host, the OpenCL runtime system
creates an integer index space. An instance of the kernel
executes for each point in this index space. We call each instance
of an executiing kernel a work-item,
which is identified
by its coordinates in the index space. These coordinates are the
global ID for the work-item.
The command that submitts a kernel for execution, therefore, creates a
collection of work-items, each of which uses the same sequence of
instructions defines by a single kernel.
Each work-item selects the data through the global ID and work-items
are organized into work-groups.
The work-groups are assigned a unique ID with the same dimensionality
as the index space used for work-items. Work-items are assigned a unique
local ID within a work-group so that work-items can be uniquely
identified by its global ID or by a combination of its local
ID and work-group ID.
The work-items in a given work-group execute concurrently on the
processing elements of a single compute unit. "Concurrency (Stream of
operations are performed in an independent fashion) " is
most important to understand in order to avoid serialize of the
execution of kernels and serialize the execution of work-groups in a
single kernel execution. OpenCL only assures that the work-items
within work-group execute concurrently (and share processor resources
on the device). You can never assume that work-groups or kernel
invocations execute concurrently.
When a kernel is executed, the programmer specifies the number of
work-items that should be created as an n-dimensional range (NDRange).
The index space spans an N-dimensional range of values and thus as
called as an NDRange.
An NDRange is a one-two-, or three-dimensional index space of
work-items that will often map to the dimensions of either the
input or the output data. Inside an OpenCL program, an NDRange is defined
by an integer array of lenght N specifying the size of the index space
in each dimension. Each work-item's global and local ID is an N-dimensional
tuple.
Work-groups are assigned IDs using a similar approach to that used for
work-items. An array of lenght N defines the number of work-groups in each
dimension.
OpenCL requires that the number of work-groups in each dimension evenly
divides the size of the NDRange index space in each dimension to keep
all work-groups are full and same size.
|
Table : Work-items and work-groups :
S.No.
|
Description
|
1.
|
work-item
The unit of concurrent execution in OpenCL C is a work-item.
(For example, in typical fragment of the code, i.e. for loop data computations
of typical multi-theaded code, a map of a single iteration of the loop to a
work-item can be done.)
OpenCL runtime to generate as many work-items as elements in the
input and output arrays and allow the runtime to map those work-items to
the underlying hardware (CPU & GPU).
(Conceptually, this is very similar to the parallelism inherent in a functional
map operation or data parallel in for loop in a model such as
OpenMP.)
|
2.
|
Identification of work-item
When a OpenCL device begins executing a kernel, it provides intrinsic
functions that allow a work-item to identify itself.
This can be achieved by calling
get_global_id(0) allows the programmer to make use of position of the current
work-item in the sample case to regain the loop counter.
|
3.
|
work-groups (N-dimensional range
(NDRange)
Each work-item selects the data through the global ID and work-items
are organised into work-groups. The work-groups are assinged a unique ID with
the same dimensionality as the index space used for work-items.
Work-items are assinged a unique local ID within a work-group so that
work-items can be uniquly identified by its global ID or by a combination of
its local ID and work-group ID.
|
4.
|
Execution in fine-grained work-items (N-dimensional range
(NDRange)
OpenCL describes execution in fine-grained workitems and
can despatch vast number of work-items > on architecture
with hardware for fine-grained threading. Scalability can be achieved
due to support of large number of work-items.
When a kernel is executed, the programmer specifies the number of
work-items that should be created as an n-dimensional range
(NDRange)
An NDRange is one-, two- or three-dimensional index space of
work-items
that will often map to the dimensions of either the input or the
output data.
The dimensions of the NDRange ae specified and an N-element array
of type
size_t, where N represents the number of dimenisons used
to describe work-items being created.
|
2D NDRange Example
|
Consider 2D NDRange example and we use the lowercase letter “g” for the global ID of a work-item
in each dimension given by a subscript x or y. An Upper case letter “G” indicated the size
of the index space in each dimension.
work-groups are assigned IDs using a similar approach to that used for
work-items. An array of lenght “N” defines the number of work-groups in each
dimension. Hence each work-item has a co-ordinate
“(gx, gy)“
in a global NDRange index space of size
“(Gx, Gy) “
and takes on the values
[0,1,2,....., (Gx-1)].
We divide the NDRange index space into work-groups.
OpenCL requires that the number of work-groups in each dimension evenly
divides the size of the NDRange index space in each dimension to keep all
work-groups are full and same size. We use lowercase “w” for the work-group ID
and an uppercase “W” for number of work-groups in each dimension.
This siz ein each direction (x and y and in our 2D example) is used to define a local index
space for each work-item and we refer this index space inside a work-group as
the local index space. The size of local index space in each dimesion
( “x” and “y”) is indicated with an uppercase
“L” and the local ID inside a work-group
uses a lowercase “l”.
Assume that NDRange index-space of size Gx by Gy
is divided into
work-groups indexed over a Wx-by-Wy space with indices
( wx, wy ). Each work-group is of size by
Lx by Ly ,
where we get the following :
Lx = Gx / Wx
Ly = Gy / Wy
We can define a work-item by its global ID
“(gx, gy)“ or by the combination of work-item's
local ID “(lx, ly)“ in a work-group and work-group ID
“(wx, wy)“.
gx = wx * Lx + lx
gy = wy * Ly + ly
Also, the local ID and work-group ID can be derived as follows :
wx = gx / Lx ;
wy = gy / Ly
lx = gx % Lx ;
ly = gy % Ly
In above equations, it is assumed that the integer division
(division with truncation) and
the modulus or “integer remainder ” operation (%).
It is assumed that the index space starts with a zero in each
dimension. Also, OpenCL provides options to define an offset
for the starting point of global index space. The offset is
defined for each direction (x,y) and we use lower case
“ o &lrquo; for the off-set. with inclusion of non-zero
off-set (ox-by-oy), the final
equation is as follows :
gx = Wx * Lx
+ lx + ox
gy = Wy * Ly
+ ly + oy
The OpencL execution models is quite flexible and supports
a wide range of programming models.
|
Execution Model : Context
|
In OpenCL, the host defines the kernel and the host establishes the
cotext for the kernels. The host defines the NDRange and the queues
tha control the details of how and when the kernels execute. The
first task for the host is to define the context for the OpenCL
application.
The context defines the environment within which the kernels are
defined and execute.
The context is an abstract container that exists on the host.
A context
coordinates the mechanisms for host-device interaction,
Manages the memory objects that are available to the devices
Keeps track of the programs and kernels that are created for each device.
Context is defines in terms of the following resources.
-
Devices : the collection of OpenCL devices to be used by the host
-
Kernels : the OpenCL functions that run on the
OpenCL device.
-
Program Objects : the program source code and executable that
implement the kernels
-
Memory Objects : : a set of objects in memory
that are visible to OpenCL devices and contain values
that can be operated on by instances of a kernel.
|
The properties argument in clCreateContext API is used to restrict the scope of the context
Context may provide a specific platform, enable graphics interoperability,
or enable other parameters in the future.
The context is an abstract container that exists on the host.
A context Coordinates the mechanisms for host-device interaction,
Manages the memory objects that are available to the devices
Keeps track of the programs and kernels that are created for each device.
The context is created and manipulated by the host using functions
from the OpenCL API.
The properties argument is used to restrict the scope of the context
Context may provide a specific platform ,enable graphics interoperability,
or enable other parameters in the future.
The API function clCreateContext. The
properties argument is used to provide a specific platform,
enable graphics inter-operability, or enable other parameters.
Programmer is required to provide contexts for multiple platforms to fully
utilize a system comprising resources from a mixture of vendors and
the number and IDs of the devices that the programmer wants to associate with the context
must be supplied. OpenCL allows user callbacks to be provided when creating
a context that can be used to report additional error information that must be
generated.
In OpenCL, the process of discovering platforms and devices and setting up a context is tedious.
However, after the code to perform these steps is written once, it can be reused or almost
any project.
|
OpenCL Context API
|
cl_context
clCreateContext(const cl_context_properties *properties,
cl_unit num_devices,
const cl_Device_id *devices,
const cl_Device_id *devices,
void (CL_CALLBACK *pfn_notify) (
const char *errinfo,
const void *private_info
size_t cb,
void *user_data)
void *user_data,
cl_int *errcode_ret)
|
The OpenCL specification also provides an API call that alleviates the need to build a list
of devices.
clCreateContextFromType()
allows a programmer to create a context that automatically
includes all devices of the specified type (e.g., CPUs, GPUs, and all devices)
After a creating a context, the function
clGetContextinfo()
can be used to query
information such as the number of devices present and device structures.
In OpenCL, the process of discovering platforms and devices and setting up a
context can be re-used for many applications.
|
Execution Model : Command-Queues
|
The host program issues commands to the OpenCL devices. The host posts the commands
which are interaction between the host and the OpenCL devices to the
command-queue . These commands wait in the command-queue until they
execute on the OpenCL device.
Communication with a device occurs by submitting commands to a command queue by host.
The command queue is a mechanism that the host uses to request action by the devices.
A command queue is created by host and attached to a single OpenCL device after the
context has been defined. Each command queue is associated with only one device.
Whenever the host needs an action to be performance by a device, it will submit
commands
to the proper command queue.
The API function clCreateCommandQueue() is used
to create a command queue
and associate it with a device.
|
OpenCL command-queue
|
cl_Command_queue
clCreateCommandQueue(
cl_context context,
cl_Device_id device,
cl_command_queue_properties properties,
cl_int *errcode_ret)
|
The host places commands into the command-queue, and the commands
are then scheduled for execution on the
associated device. OpenCL supports three types of commands :
-
Kernel Execution commands : executes a kernel on the processing
elements of an OpenCL device
-
Memory commands : transfer data between the host and different
memory objects move data between memory objects, or map and unmap
memory objects from the host address space.
-
Synchronization commands : put constraints on the order
in which commands execute.
|
In host program, it is responsibility of programmer to define the
context, command queue and define memory and program objects. Also,
programmer responsibility is to build data structures needed on the
host to support application.
Before the program issues commands to the OpenCL devices, memory objects
are moved from the host onto devices. Also, the kernel arguments are attached to
memory objects and then submitted to the command-queue for execution on
device.
When multiple kernels are submitted to the queue, the kernels may interact and memory
objects generated by one set of kernels may manipulate other set of kernels. In such
situations, synchronization commands can be used for completion of one set of kernels
before other kernels.
OpenCL uses default in-order command queue
If out-of-order queues are used, it is up to the user to specify
dependencies that enforce a correct execution order.
in-order commands are launched in the order in which they appear
in the command-queue and complete in order.
|
Command-Queues - Event Objects
To support synchronization protocols , commands submitted to command-queue generate
event objects. Any Operation that executes a command into a command queue - that is
any API call that begins the clEnqueue an event. Events represent dependencies
and provde a mechanism for profiling. OpenAPI calls that being with
clEnqueue also take a “waitlist” of events as a parameter.
An clEnqueue call will block until all the events in ints waitlist have
completed.
|
Command-Queues - Flush & Finish
The flush and finish commands are two different types of binary operations
for a command queue. The clFinish() function blocks
until all of the commands in a command queue have completed.
The clFlush() function blocks until all of the commands
in a command queue have been removed from the queue.
|
Memory Model
|
OpenCL memory model cover memory objects which are associated with command-queue and
interact with host and other kernels.
OpenCL defines two types of memory objects buffer and Image objects.
Buffers are equivalent to arrays in C, created using malloc(), where data
elements are stored contiguously in memory. A programmer can map data structures
on this buffer and access the buffer through pointers.
Whenever a memeory object is created, it is valid only within a single context.
Movement to and from specific devices is managed by the OpenCL runtime as
necessary to satisfy data dependencies.
The API function clCreateBuffer() allocates the buffer and
returns a memory object.
Creating a buffer requires supplying the size of the buffer and a context in which
the buffer will be allocated, it is visible for all devices associated with the
context.
|
cl_mem clCreateBuffer(
cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr,
cl_int *errcode_ret)
|
The OpenCL API calls for reading and writing to buffers are similar
and the write buffer is described below.
|
cl_int
clEnqueueWriteBuffer(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_write,
size_t offset,
size_t cb,
const void *ptr,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
|
Image Objects
|
Images are type of OpenCL memory object that abstract the storage of physical data
to allow for device specific optimizations. The openCL framework provides
functions to manipulate images. Images are an example of the OpenCL standard
being dependent on the under-lying hardware of a particular device.
The API function clCreateImage2D() or
clCreateImage3D() is used for creating an OpenCL image.
|
cl_mem
clCreateImage2D(
cl_context context,
cl_mem_flags flags,
const cl_image_format *image_format,
size_t image_width,
size_t image_height,
size_t image_row_pitch,
void *host_ptr,
cl_int *errcode_ret)
|
The openCL memory model defines five distinct memory regions
in addition to memory objects. These memory spaces are relevant
with OpenCL programs. The key words associated with each space
can be used to specify where a variable should be created or
where the data that points to reside.
Host memory : The host manages this memory and openCL
defines how the host memory interacts with OpencL OpenCL objects
and constructs.
Global memory : is visible to all compute units on the
device (similar to main memory on CPU-based host system). Whenever
the data is transformed from the host to the device, the data
will reside in global memory. Any data that is transferred from device
to the host must also reside in global memory. This memory region permits
read/write to all work-items in all work-groups. Work-items can read
from or write to any element of a memory object in global memory.
Reads and writes in global memory may be cached depending on the
capabilities of the device. The key-word
__global is added to a pointer declaration to specify
that data referenced by the pointer resides in global memory.
Constant memory : is “read-only ” accessed simultaneously by all work-items.
This memory region of global memory remains is
constant during the execution of a kernel. The host allocates and
initializes memory objects placed into constant memory.
Constant memory is modeled as a part of global memory, so memory objects
that are transferred to a global memory can be specified as constant.
Data is mapped to constant memory by using the
__constant keyword.
Local Memory : is a “scratchpad memory ”
whose address space is unique to each compute device and it may be
implemented as a dedicated regions of memory on the OpenCL device. Local
memory is modeled as being shared by all work-items in that work-group.
Calling
clSetKernelArg() keyword with a size, but no argument allows local
memory to be allocated at runtime, where a kernel parameter is defined as
a __local pointer. (e.g.
__local float*sharedData). Also, arrays can be
statically declared in local memory by appending the key-word
__local and it is necessary to specify the
array size at compile time.
Private Memory : is “unique memory ”
an individual work-item. Variables define din work-item's private
memory are to visible to other work-item. Local variables and
non-pointer kernel arguments are private by default.
In OpenCL, the work-items run in Processing Elements (PEs) and have
their own private memory. A work-group runs on a compute unit and shares
a local memory region with the work-items in the group. The OpenCL
device memory works with the host to support global memory. The interaction
betwen host and device occurs by explicitly copying data or by mapping
and unmapping regions of a memory object. OpenCl defines a relaxed
consistency model and ensures consistency between the host and the OpenCL device.
OpenCL defines the consistency of memory objects relative to the commands on the
command-queue and memory is consistent only at synchronization points
on the command-queue.
|
Table describes whether the kernel or the host from a memory region, the type of allocation
(static i.e. compile time versus
dynamic time i.e. runtime) and the type of access allowed i.e. whether the kernel or the host can read and/or write to a
memory region.
|
|
Global |
Constant |
Local |
Private |
Host |
Dynamic
Allocation
Read /Write
Memory
|
Dynamic Allocation
Read /Write Memory
|
Dynamic Allocation
No
|
No
Allocation
No
|
Kernel |
No
Allocation
Read /Write Memory
|
Static
Allocation
Read only Memory
|
Static
Allocation
Read / Write Memory
|
Static
Allocation
No |
|
|
-
The application running on the host uses the OpenCL API to create memory objects in global
memory, and to enqueue memory commands (Refer OpenCL API specification) that operate on those
memory objects.
-
OpenCL uses a relaxed consistency memory model: i.e the state of memory visible to a work item is not guaranteed to be
consistent across the collection of work-items at all times.
|
Programming Model
|
OpenCL programming model defines two different programming models such as data parallelism
and task parallelism. In the programming model, the algorithm desgner aligns the data structures
in his/her problem with the NDRange index space which is defined when the kernel is launched.
These are mapped onto OpenCL memory objects. The kernel defines sequence of instructions to be
applied concurrently as the work-items in an OpenCL computation.
The work-item in a single
work-group need to share the data ad work-items in a single work-group cna participate in a
work-group barrier. OpenCL 1.1. does'nt support any mechanism for synchronization between
work-items from different work-groups while executing a kernel.
The OpenCL specification model defines two variants of data parallelism i.e. explicit and implicit
models. In the explicit model, the programmer takes responsibility for explicitly defining the
sizes of work-groups. In an implicit model, the programmer defines the NDRange Space and leaves it
to the system to choose the work-groups. The programmer responsibility in data parallel model
is to avoid branch statements in the computations and ensure that each work-item will execute
identical operations but on a subset of data items selected by its global ID.
Equivalently,
the computations need to be executed in SIMD or SPMD models. OpenCL supports SIMD or SPMD model
of data parallel computations.
In task-Parallel Programming model, the OpenCL defines a tasks as a kernel tht executes a single
work-item regardless of NDRange used by other kernels in the OpenCL application. Also, tasks
execution with an out-of-order queue when kernels are submitted is another type of task parallelism
in which scheduling the work is done by compute units. The OpenCL events model can perform task
parallelism when the commands submitted to an event queue. The OpenCL programmer can define
static task graphs to carry-out task-parallel computations.
The OpenCL programmer is free to combine OpenCL's programming model to other hybrid progrmaming
models such as MPI, Pthreads & Intel TBB.
|
Creating an OpenCL Program Object
|
The OpenCL C language code which runs on an OpenCL device is called a program.
A program is
a collection of functions called kernels, where kernels are units of execution that
can be scheduled to run on a device. OpenCL programs are compiled at runtime through a
series of API calls and runtime compilation gives the system an opportunity to optimize
for a specific device. OpenCL software links only to a common runtime layer (called the ICD),
all platform-specific SDK activity is performed at runtime through a dynamaic library interface.
The process of creating a kernel is as follows.
The OpenCL C source code is stored in a character string.
The source code is turned into a program object,
cl_program, by calling
cl_CreateProgramWithSource().
The Program object is then compiled, for one or more OpenCL
devices, with
clBuildProgram(), If there are compiler errors,
they will reported here.
|
The precise binary representation used is very vendor specific. OpenCL
provides a function to return information about the program objects,
clGetProgramInfo(). . OpenCL also provides
cl_CreateProgramWithBinary(), which takes
a list of binaries that matches the device list.
|
The OpenCL Kernel Execution
|
First Step : Obtain a cl_kernel() Object :
To execute kernel on a device, cl_kernel() object
can be used which can be extracted from the
cl_program().
This can be achieved by exporting from the appropriate object. The name of the
kernel that the program exports are used to request it from the compiled program object.
The name of the kernelis passed to cl_CreateKernel(), along
with program object, and the kernel object will be returned if the program object
was valid and the particular is found.
Second Step : Dispatching Kernel through an enqueue function :
The process of setting kernel arguments and information on arguments to be transferred
to the device is required. Executing a kernel requires dispatching it though an enqueue function.
The fact that the kernel arguments are persistent, we must specify each kernel
argument individually using the function
clSetKernelArg().
This function takes a kernel object, an index specifying the argument number, the size of the argument, and a pointer to the arguement. When a kernel is executed, this information is
used to transfer arguments to the device.
After any required memory objects are transferred to the device, and the kernel arguments
are set, the kernel is ready to be executed. Requesting that a device begin executing a
kernel is done with a call to clEnqueueNDRangeKernel()
|
cl_int
clEnqueueNDRangeKernel(
cl_command_queue command_queue,
cl_kernel kernel,
cl_unit work_dim,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_offset,
cl_unit num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
|
The clEnqueueNDRangeKernel() call is asyhronous
and it will return immediately after the corresponding events will have completed
before the kernel begins execution.
|
The OpenCL Platform Layer APIs
-
The OpenCL platform layer which implements platform specific features that allow applications to query OpenCL device
configuration information and to create OpenCL contexts using one or more devices.
-
Querying Platform Info
The function
cl_int
clGetPlatformInfo
(
cl_platform_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret
)
gets specific information about the OpenCL platform.
-
Querying Devices
The list of devices available can be obtained using he following function
cl_int
clGetDeviceIDs
(
cl_device_type device_type,
cl_uint num_entries,
cl_device_id *devices,
cl_uint *num_devices
)
gets specific information about the list of devices available
-
Contexts
The function
cl_context
clCreateContext
(
cl_context_properties properties,
cl_uint num_devices,
const_cl_device_id * devices,
void (* pfn_notify) ,
(const char * errinfo,
const void *prviate_info,
size_t ch,
void *user_data),
void *user_data ,
cl_int *errorcode_ret
)
creates an OpenCL context. An OpenCL context is created with one or more devices.
contexts are used by the OpenCL runtime for managing objects such as
command-queues, memory, program and kernel objects and for executing kernels on one
or more devices specified in the context.
-
Contexts
The function
cl_context
clCreateContextFromtype
(
const cl_context_properties *properties,
cl_device_type device_type
void (*pfn_notify),
(const char *errinfo,
const void *prviate_info,
size_t cb,
void *user_data ),
void *user_data ,
cl_int *errorcode_ret
)
creates an OpenCL context from a device type that identifies the specific device(s) to use.
properties specifies a list of context property names and their corresponding values. Each
property name is immediately followed by the corresponding desired value.
|
The OpenCL Runtime APIs
-
The OpenCL provides the API that calls the manager OpenCL objects such as command-queues, memory objects, program objects,
kernel objects for __kernel
functions in a program and calls that allow you to enqueue commands to a command-queue such as executing a kernel,
reading, writing a memory object.
-
Command Queues
The function
cl_command_queue
clCreateCommandQueue
(
cl_context context,
cl_device_id devices,
cl_command_queue_properties properties,
cl_int *errorcode_ret
)
creates a command-queue on a specific device.
-
Memory Objects
Memory objects are categorized into two types : buffer and
image objects. A buffer object stores a one-dimensional
collection of elements whereas an image object is used to store
a two- or three- dimensional texture, frame-buffer or image.
A buffer Object is created using the following-command function
cl_mem
clCreateBuffer
(
cl_context context,
cl_mem_flags flags,
size_t size,
void *host_ptr
cl_int * errorcode_ret
)
context is a valid OpenCL context used to create the buffer object.
-
Reading, Writing and Copying Buffer Objects
The following functions enqueue commands to read from a buffer
object to host memory or write to a buffer object from
the host Memory.
cl_int
clEnqueueReadBuffer
(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_read,
size_t offset,
size_t ch,
void *ptr
cl_uint mem_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event
)
cl_int
clEnqueueWriteBuffer
(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_write,
size_t offset,
size_t ch,
const void *ptr,
cl_uint mem_events_in_wait_list,
cl_event *event_wait_list,
const cl_event *event_wait_list,
cl_event *event
)
cl_int
clEnqueueCopyBuffer
(
cl_command_queue command_queue,
cl_mem buffer,
cl_mem dst_buffer,
size_t src_offset,
size_t dst_offset,
size_t ch,
cl_uint mem_events_in_wait_list,
cl_event *event_wait_list,
const cl_event *event_wait_list,
cl_event *event
)
-
Retaining and Releasing Memory Objects
The function
cl_int
clRetainMemObject
(
cl_mem memobj
)
increments the memobj reference count.
clRetainMemObject returns
CL_SUCCESS
if the function is executed successfully. It returns
CL_INVALID_MEM_OBJECT
if memobj is not a valid memory object.
-
Reading, Writing and Copying Image Objects
The following functions enqueue commands to read from 2D or 3D image object to host
memory or write to a 2D or 3D image object from host memory.
cl_int
clEnqueueReadImage
(
cl_command_queue command_queue,
cl_mem image,
cl_bool blocking_read,
const size_t t_origin[3],
const size_t t_region[3] ,
size_t row_pitch
size_t slice_pitch,
void *ptr,
cl_unit num_events_in_wait_list,
const cl event *event_wait_list,
cl_event *event
cl_event *event
)
cl_int
clEnqueueWriteImage
(
cl_command_queue command_queue,
cl_mem image,
cl_bool blocking_read,
const size_t t_origin[3],
const size_t t_region[3] ,
size_t input_row_pitch
size_t input_slice_pitch,
const void *ptr,
cl_unit num_events_in_wait_list,
const cl event *event_wait_list,
cl_event *event
)
The function
cl_int
clEnqueueCopyImage
(
cl_command_queue command_queue,
cl_mem src_image,
cl_mem dst_image,
const size_t src_origin[3],
const size_t dst_region[3],
const size_t region[3],
cl_unit num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event
)
enqueues a command to copy image obehcts. src_image and dst_image
can be 2D or 3D image
objects allowing us to perform the following actions.
-
Copy a 2D image object to a 2D image object
-
Copy a 2D image object to a 2D slice of a 3D image Object
-
Copy a 2D slice of a 3D image object to a 2D image object
-
Copy a 3D image object to a 3D image object.
-
Mapping and Unmapping Memory Objects
The function
void *
clEnqueueMapBuffer
(
cl_command_queue command_queue,
cl_mem image,
cl_bool blocking_read,
cl_map_flags map_flags,
size_t offset,
size_t cb,
cl_unit num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event
cl_int *errcode_ret
)
enqueues a command to map a region of the buffer object given by buffer inro the host address space and returns
a pointer to this mapped region.
The function
void *
clEnqueueMapImage
(
cl_command_queue command_queue,
cl_mem image,
cl_bool blocking_read,
cl_map_flags map_flags,
const size_t origin[3],
const size_t region[3],
size_t *image_row_pitch,
size_t *image_slice_pitch,
cl_unit num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event
cl_int *errcode_ret
)
enqueues a command to map a region into the image object given by image into the host address
space and returns a pointer to this mapped region.
-
Memory Objects Queries
To get information that is common to all memory objects (buffer and image objects), use the following function
cl_int
clGetMemObjectInfo
(
cl_mem memobj,
cl_mem_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret
)
memobj specifies the memory object being queried.
cl_int
clGetgetImageInfo
(
cl_mem memobj,
cl_image_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret
)
To get Information specific to an image object created with
clCreateImage{2D|3D}.
-
Memory Objects Queries
A sampler object describes how to sample an image when the image is read in the kernel.
The OpenCL functions are used to create sampler objects.
cl_sampler
clCreateSampler
(
cl_context context,
cl_bool normalized_coords,
cl_addressing_mode addressing_mode,
cl_filer_mode filer_mode,
cl_int *errorcode_ret
)
The above function creates a sample object.
cl_int
clGetSamplerInfo
(
cl_sampler sampler,
cl_sampler_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret
)
To get Information specific to an image object created with
clCreateImage{2D|3D}.
-
Program Objects
An OpenCL program consists of a set of kernels that are identified as function declared with
the __kernel qualifier in the
program source. OpenCL programs may also contain auxiliary functions and constant data that
can be used by __kernel functions.
Creating Program Objects
The function
cl_program
clCreateProgramWithSource
(
cl_contex tcontext
const char **strings,
const size_t *lengths,
cl_int *errorcode_ret
)
The above function creates a program object for a context, and loads the source code specified by the text
strings in the string array into the program object. The devices associated with the program object are
the devices associated with the context
cl_program
clCreateProgramWithBinary
(
cl_context context,
cl_uint num_devices,
const cl_device_id *device_list,
const size_t *lengths,
const void **binaries,
const void *errorcode_ret
cl_int *binary_status,
cl_int *errcode_ret
)
The above function creates a program object for a context, and loads the binary bits specified by
binary into the program object.
Building Program Executables
The function
cl_int
clBuildProgram
(
cl_program program,
cl_uint num_devices,
const cl_device_id *device_list,
const char *options,
void (*pfn_notify) ,
(cl_program,
void *user_data),
void * user_data
)
builds (compiles & links) a program executable from the program source or binary for all
the devices or a specific device(s) in the OpenCL context associated with program.
Building Options
The build options are categorized as pre-processor options, options for math intrinsics, options that control
optimization and miscellaneous options.
-
Kernel Objects
A kernel is a function declared in a program. A kernel is identified by the
__kernel qualifier applied to
any function in a program. A kernel object encapsulates the specific
__kernel function
declared in a program and the argument values to be used when executing this
__kernel function.
Creating Kernel Objects
To create a kernel object, use the function
cl_kernel
clCreateKernel
(
cl_program program
const char **kernel_name,
const size_t *lengths,
cl_int *errorcode_ret
)
To create kernel objects for all kernel functions in the program, use the function
cl_int
clCreateKernelsInProgram
(
cl_program program
cl_uint num_kernels,
cl_kernel *kernels,
cl_uint *num_kernels_ret
)
-
Kernel Objects
To create kernel objects for all kernel functions in the program, use the function
Setting Kernel Arguments
To execute a kernel, the kernel arguments must be set. The function
cl_int
clSetKernelArg
(
cl_kernel kernel,
cl_uint arg_index,
size_t arg_size,
const void *arg_values
)
is used to set the argument value for a specific argument of a kernel.
-
Execution Kernel
The function
cl_int
clEnqueueNDRangeKernel
(
cl_command_queue command_queue ,
cl_kernel kernel,
cl_uint work_dim ,
const size_t *global_work_offset,
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event
)
enqueues a command to execute a kernel on a device.
The function
cl_int
clEnqueueTask
(
cl_command_queue command_queue ,
cl_kernel kernel,
cl_uint work_dim,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event
)
enqueues a command to execute a kernel on a device.
The function
cl_int
clEnqueueNativeKernel
(
cl_command_queue command_queue ,
void (*user_func) (void(*)
void (*args),
size_t cb_args,
cl_uint num_mem_objects,
size_t arg_size,
const cl_mem **mem_list
const void ***args_mem_loc
cl_uint num_events_in_wait_list ,
const cl_event *event_wait_list,
cl_event *event
)
enqueues a command to execute a native C/C++ function not compiled using the OpenCL Compiler.
Event objects can be used to refer to a kernel execution command (
clEnqueueNDRangeKernel,
clEnqueueTask,
clEnqueueNativeKernel)
or read, write, map and copy commands on memory objects (
(
clEnqueue{Read|Write|Map|} {Buffer|Image},
clEnqueueCopy{Buffer|Image}, clEnqueueCopyBufferToImage, or
(
clEnqueueCopyImageToBuffer).
An event object can be used to track the execution of a command.
The function
cl_int
clWaitForEvents
(
cl_uint num_events,
const cl_event *event_list,
)
waits on the host thread for commands identified by event objects in
event_list to complete.
The function
cl_int
clGetEventsInfo
(
const cl_event event,
cl_command_queue command_queue ,
cl_event_info param_name,
size_t param_value_size,
void **param_value
size_t *param_value_size_ret,
)
returns information about the event object.
|
An Overview of Basic Programming Steps :
Given below, illustrate the basic programming steps required for a minimum amount of code.
Many test programs might require similar steps and these steps do not include error checks.
1.
|
Query OpenCL Platform : The host program must select a OpenCL platform, which is an
abstraction for a given OpenCL implementation . Developer can use
clGetPlatformIDs(..) API
to get OpenCL platforms .
|
2.
|
Query OpenCL Device : A device id for OpenCL devices is requested. Developer can use
clGetDeviceIDs(..) API to
find a OpenCL device. A CPU device could be requested by using
CL_DEVICE_TYPE_CPU instead.
The device can be a physical device, such as a given GPU etc..
|
3.
|
Create Context : On the selected device, an OpenCL context is created. Developer can
use clCreateContext(..)
or relevant API to create a context. A context ties together a device memory buffers related to
that device.
|
4.
|
Create Command Queue : After creating OpenCL context, create a command queue. Command queue can
be created using the clCreateCommandQueue(..)
API.The command queue issues commands to a specific compute device, and ensures that a set of operations
occur in a particular order.
|
5.
|
Create Program Object : Before an OpenCL kernel can be launched, its program source is
compiled, and a handle to the kernel is created. Create an program object directly from the source code
of an OpenCL program and compile it at application runtime. Alternatively, program object can be created
using a previously built binary to avoid compilation at runtime. To build a program object from source,
use the clCreateProgramWithSource(..)
and To build a program object from source, use the
clCreateProgramWithBinary(..).
|
6.
|
Build Program Executable : After creation of a program object using
either clCreateProgramWithSource(..)
or clCreateProgramWithBinary(..) .
Developer must build a program executable from the contents of that program object.
Building the program compiles any source code that is in the program object and links
the resultant machine code into an executable program. Use the
clBuildProgram(..) API for that.
|
7.
|
Create Kernel Object : Kernel object encapsulates the specific kernel function declared
in a program and can also encapsulate the parameter values to use when executing this kernel.
Developer can use clCreateKernel(..)
API to create a single kernel object or call the
clCreateKernelsInProgram(..)
API to create kernel objects for all the kernels in the OpenCL program.
|
8.
|
Create Memory Object : Memory objects are reserved regions of global device memory that can
serve as containers for your data.Memory buffer can be allocated on the device as per program requirements.
Developer can use clCreateBuffer(..)
API to create a data buffer.
|
9.
|
Launch / Execute kernel : Developer can use
clEnqueueNDRangekernel(..)
API to Enqueues a command to execute a kernel on a device.
|
10.
|
Retrieving the Results: After kernel has completed execution, read data from the
device back to the host where it is accessible to the host application. To read the data,
call the clEnqueueReadBuffer(...)
or other related API.
|
11.
|
Clean up : After host application no longer requires the various objects associated with
the OpenCL runtime and context, it should free these resources. Developer can use the following APIs
to release your OpenCL objects
clReleaseMemObject(...) clReleaseKernel(...)
clReleaseProgram
clReleaseCommandQueue
clReleaseContext
or other related API.
|
|
Build application - Steps
-
First querying the runtime to determine which platforms are present. There can be any number of
different OpenCL implementations are present.
-
Create a context (The OpenCL Context has associated with it a number of compute devices
such as CPU or GPU devices)
Within a context, OpenCL guarantees a relaxed consistency between these devices.
This means that memory objects, such as buffers or images, are allocated per context, but
changes made by one device are only
guaranteed to be visible by another device at well-defined synchronization points.
-
OpenCL provides events, with the ability to synchronization on a given event to enforce the
correct order of execution,
-
Many operations are performed with respect to a given context: there are many operations
specific to a device.
For example, program compilation and kernel execution are done on a peer-device basis.
-
Performing work with a device, such as executing
kernels or moving data to end from the device's local memory is done using a corresponding a command queue.
-
A command queue is associated with a single device and a given context. : all work for a specific
device is done through this
interface. Note that while a single command queue can be associated with only a single device.
For example, it is possible to have one
command queue for executing kernels and a command kernel for managing data transfers between the
host and the device.
Most OpenCL program follows the same pattern. Given a specific platform, select a device or devices
to create a context, allocate memory,
create device-specific command queues to create a context, allocate memory, create device-specific
command queues, and perform data
transfers & computations.
Generally, the platform is the gateway to accessing specific devices, given these devices and a corresponding
context the application is independent of the platform. Given a context, the application can:
-
Create a command queues
-
Create programs to run on one or more associated devices
-
Create kernels within those programs
-
Allocate memory buffers or image, either on the host or on the device(s) (memory can be copied between
the host and device)
-
Write data to the device
-
Submit the kernel (with appropriate arguments) to the command queue for execution.
-
Read data back to the host form the device.
The relationship between context(s), device(s), buffer(s), program(s), kernel(s), and command queue(S) is
best seen by looking at
simple code.
|
The OpenCL Framework
-
The OpenCL Framework allows applications to use host and one or more OpenCL devices as a single heterogeneous
parallel computer system. The framework contains the components OpenCL Platform layer,
OpenCL Runtime, and
OpenCL Compiler.
The OpenCL Platform Layer
The OpenCL Runtime
-
Command Queues
-
Memory Objects
- Creating Buffer Objects,
- Reading, Writing and copying Buffer Objects
- Retaining and Releasing Memory Objects
- Creating Image Objects,
- Querying List of Supported Image formats,
- Reading, Writing and Copying Image objects
- Copying between Image and Buffer Objects
- Mapping and Unmapping Memory Objects
- Memory Object Queries
-
Sampler Objects
-
Program Objects
- Creating Program Objects
- Building Program Executables
-
Build Options
- Options (Preprocessor,
Math Intrinsic,
Optimization)
- Uploading the OpenCL compiler
- Program Object Queries
-
Kernel Objects
- Creating Kernel Objects
- Setting Kernel Arguments
- Kernel Object Queries
-
Executing Kernels
-
Event Objects
-
Profiling Operations on Memory Objects and Kernels
-
Flush and Finish
The OpenCL Compilation
|
The OpenCL Compilation :
The Compiler tool-chain provides a common framework for both CPUs & GPUs, sharing the front-end and some high-level
compiler transformations. The back-ends are optimized for the device type (CPU or GPU). Most of the application remains
same, but OpenCL APIs are included at various parts of the code. The kernels are compiled by the OpenCL compiler to
either CPU binaries or GPU binaries, depending on that target device.
-
CPU Processing :For CPU processing, the OpenCL runtime uses the LLVM AS ( Low-level virtual Machine )
to generate x86 binaries. The OpenCL runtime automatically determines the number of processing elements or cores, present in the
CPU and distributes the OpenCL kernel between them.
-
GPU Processing : For GPU processing, the OpenCL runtime layer generates GPU specific AMD -ATI binaries with CAL or
CUDA enabled NVIDIA architecture GPU binaries.
|
Compilation Program
An OpenCL application consists of a host program
(C/C++)
and an optional kernel program
(.cl).
To compile an OpenCL application, the host program
must be compiled and this can be done using the off-the-shelf compiler such as
g++
or
MSV++.
The application kernels are compiled into device-specific binaries using the OpenCL compiler.
The compiler uses a standard C front-end as well as the LLVM framework, with extensions for OpenCL.
To compile OpenCL applications on Windows requires that Visual Studio 2008 Professional Edition
and the Intel C compiler and all C++ files must be added with appropriate settings. To compile
OpenCL applications on Linux requires that the
gcc or the Intel C compiler is
installed and all C++ files must be compiled with appropriate settings on 32-bit /64-bit systems.
The OpenCL Library and runtime environment depends upon the target GPU (i.e CUDA enabled NVIDIA or AMD ATI - Stream DSK).
For more control over the process of compiling and linking programs for OpenCL programs,
you should use a Makefile . The user has to specify the names of the program and specify correct
available platform on the system in the Makefile.
To compile OpenCL program, type the command,
make
The appropriate OpenCL Program will be compiled and executable is installed in "/bin"
Running Program
An OpenCL application is compiled on the target system, the runtime system assigns the work in the command queues
to the underling devices. Commands are placed into the queue using the
clEnqueue
commands
shown below. The commands can be broadly classified into three categories.
-
kernel commands (for example,
clEnqueueNDRangeKernel(), etc.).
-
Memory commands (for example,
clEnqueueNDReadBuffer(), etc.), and
-
Memory commands (for example,
clEnqueueWaitForEvents(),etc.
|
An OpenCL application can create multiple command queues and please refer OpenCL specification or
OpenCL Programming Guide for the CUDA Architecture or AMD ATI Stream computing OpenCL Programming Guide.
To execute a OpenCL Program,change the directory to “/bin” directory &
give the name of the executable at command prompt.
$ cd ./bin
$ . / < Name of the Executable >
Example :
$ ./PrefixSum
Read the instructions displayed on screen to execute the programs.
NOTE :
For Compilation & execution of the OpenCL program : OpenCL Driver should be installed
on the system properly. OpenCL Library file path should be included in
environment variable “LD_LIBRARY_PATH”.
|
Simple Vector Vector Addition Program
|
The Simple Vector Vector addition example demonstrates the simplest
aproach for an OpenCL implementation in which approriate
platform is chosen. The
The main() function is
reproduced along with source code kernel. The
main()
function of
Vector Vector Addition example either
implement or calls the
functions that perform the following operations.
Create an OpenCL context on the first available device
Create a command \96queue on the first available device
Load a kernel file ( vect-vect-addition.cl ) and build it into a program object
Create a kernel object for the kernel function vect-vect-addition()
Query the kernel for execution
Read the results of the kernel back into the result buffer
|
Example Program : Vector -Vector Addition OpenCL Kernel & Main Function
|
VectVectAddition.cl :
__kernel void VectVectAddition_kernel(
__global const float *a,
__global const float *b,
__global float *result)
{
int gid = get_global_id(0);
result[gid] = a[gid] + b[gid];
}
//
// Cleanup any created OpenCL resources
//
void Cleanup(
cl_context context,
cl_command_queue commandQueue,
 
cl_program program,
cl_kernel kernel,
cl_mem memObjects[3] )
{
for(int i = 0; i < ARRAY_SIZE; i++)
{
if (memObjects[i] != 0)
clReleaseMemObject(memObjects[i]);
}
if (commandQueue != 0)
clReleaseCommandQueue(commandQueue);
if (kenrel != 0)
clReleaseKernel(kernel);
if (program != 0)
clReleaseProgram(program);
if (context != 0)
clReleaseContext(context);
}
VectVectAddition.cpp:
int main(int argc, char** argv) {
cl_context context = 0;
cl_command_queue commandQueue = 0;
cl_program program = 0;
cl_device_id device = 0;
cl_kernel kernel = 0;
cl_mem memObjects[3] = ( 0, 0, 0 );
cl_int errNum;
// Create an OpenCL context on first available platform
context = CreateContext();
if(context == NULL)
{
cerr << “Failed to create OpenCL context.” << endl;
return 1;
}
// Create a command-queue on the first device available
// on the created context
commandQueue = CreateCommandQueue(context, sdevice);
if(commandQueue == NULL)
{
Cleanup(context, commandQueue, program, kernel,
memObjects);
return 1;
}
// Create OpenCL program from VectVectAddition.cl kernel source
program =
CreateProgram(context, device, “VectVectAddition.cl”);
if (program == NULL)
{
Cleanup(context, commandQueue, program, kernel,
memObjects);
return 1;
}
// Create OpenCL kernel
kernel =
clCreateKernel(program, "VectVectAddition_kernel", NULL);
if(kernel == NULL)
{
cerr << “Failed to create OpenCL context.” << endl;
return 1;
}
// Create memory objects that will be used as arguments to
// kernel. First create host memory arrays that will be
// used to store the arguments to the kernel
float result[ARRAY_SIZE];
float a[ARRAY_SIZE];
float b[ARRAY_SIZE];
//Fill Arrays 'a' and 'b' on the host
for(int i = 0; i < ARRAY_SIZE; i++) {
a[i] = i;
b[i] = i+2;
}
// Copy the arrays into memory objects that will be passed
// to this kernel
if (!CreateMemObjects(context, memObjects, a, b) )
{
Cleanup(context, commandQueue, program, kernel,
memObjects);
return 1;
}
// Set the kernel arguments (result, a, b)
errNum = clSetKernelArg(kernel, 0,
sizeof
( cl_mem ), &memObjects[0]);
errNum = clSetKernelArg(kernel, 1,
sizeof
( cl_mem ), &memObjects[1]);
errNum = clSetKernelArg(kernel, 2,
sizeof
( cl_mem ), &memObjects[2]);
if(errNum != CL_SUCCESS)
{
cerr <<
“Error setting kernel arguments.” << endl;
Cleanup(context, commandQueue, program, kernel,
memObjects);
return 1;
}
size_t globalWorkSize[1] = { ARRAY_SIZE };
size_t localWorkSize[1] = { 1 };
// Queue the kernel up for execution across the array
errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL,
globalWorkSize, localWorkSize,
0, NULL, NULL);
if(errNum != CL_SUCCESS)
{
cerr <<
“Error queuing kernel for execution,” << endl;
Cleanup(context, commandQueue, program, kernel,
memObjects);
return 1;
}
// Read the output buffer back to the Host
errNum = clEnqueueReadBuffer(commandQueue, memObjects[2],
CL_TRUE, 0, ARRAY_SIZE * sizeof(float)
result, NULL, NULL);
if(errNum != CL_SUCCESS)
{
cerr <<
“Error queuing kernel result buffer,” << endl;
Cleanup(context, commandQueue, program, kernel,
memObjects);
return 1;
}
// Output the result buffer
for(int i = 0; i < ARRAY_SIZE; i++) {
cerr <<
“Error queuing kernel result[i],” << endl;
}
cout << endl;
cout << "Executed program successfully." << endl;
Cleanup(context, commandQueue, program, kernel,
memObjects);
return 0;
}
|
- Objective
Write OpenCL program to calculate multiplication of scalar value and vector
- Description
We create a one-dimensional globalWorkSize array that is overlaid on vector.
The input vector using Single Precision/Double Precision input data are generated on
host-CPU and transfer the vector to device-GPU for scalar-vector multiplication.
In global memory,a simple kernel based on the 1- dimension indexspace of
work-groups is
generated in which work-item is given a unique ID within its work-group.
The final resultant vector is generated on device and transferred to Host.
This code demonstrates the development of OpencL Kernel for simple computations.
Each work-item performs multiplication of element of a vector with scalar using work
item ID.
-
The choice of work-items in the code are given as
size_t globalWorkSize[3] = [128, 1,1];
Using local memory and choice of multiple work-items in the
same workgroup may increase the performance. It is possible to achieve this
with an explicit assignment from a
global memory pointer as a local memory pointer. The below table indicates the
important steps to be performed for exeuciton on device.
|
A brief summary of total opreations required for the OpenCL program.
Steps
|
Description
|
1.
|
Memory allocation on host and Input data Generation
Do memory allocation on host-CPU
and fill with the single or double prcesion data.
|
2.
|
Set opencl execution environment :
Call the function setExeEnv which sets execution environment for opencl
which performs the following :
- Get Platform Information
- Get Device Information
- Create context for GPU-devices to be used
- Create program object.
- Build the program executable from the program source.
The function performs
(a). Discover & Initilaise the platforms;
(b). Discoer & Initialie the devices;
(c). Create a Context; and
(d). Create program object build the program executable
|
3.
|
Create command queue using
Using clCreateCommandQueue(*)
and associate it with the device you want to execute on.
|
4.
|
<
Create device bufffer
using
clCreateBuffer() API
that will contain the data from
the host-buffer.
|
5.
|
Write host-CPU data to device buffers
|
6.
|
Kernel Launch :
(a). Create kernel handle;
(b). Set kernel arguments;
(c). Configure work-item strcture (
Define global and local worksizes and launch kernel for execution on device-GPU); and
(d). Enqueue the kernel for execution
|
7.
|
Read the outpur Buffer to the host (Copy result from
Device-GPU to host-CPU :)
Use
clEnqueueReadBuffer() API.
|
8.
|
Check correctness of result on host-CPU
Perform computation on host-CPU and compare CPU and GPU results.
|
9.
|
Release OpenCL resources (Free the memory)
Free the memory of arrays allocated on host-CPU & device-GPU
|
Kernel Exeuction on a device : Work groups & work-items :
S.No.
|
Description
|
1.
|
Example :
Assume that 1024 elements are taken in each vector. The size can
be specified as a one- , two- , or three- dimensional
vector. The host code to specify an ND Range for 1024 elements is follows :
size_t indexSpaceSize[3] = [1024, 1,1];
Most importantly, dividing the work-items of an NDRange into
smaller, equal sized workgroups as shown in Figure.
|
2.
|
Example :
Perform Barrier Operations & synchronization
work-tems within a workgroup can peform barrier
operations to synchronize and they have access to a shared memory address space
Because workgroups sizes are fixed, this communication does not have
have a need to scale and hence does not affect scalability of a large
concurrent dispatch.
For example 1.3, i.e., Vector Vector Addition, the workgroup
can be specified as
size_t workGroupSize[3] = [64, 1, 1];
If the total number of work-items per array is 1024, this results
in creating 16 work-groups (1024 work-items / 64
per workgroups.
Most importantly, OpenCL requires that the index space sizes are evenely
divisible by the work-group sizes in each dimension.
For hardware efficiency, the workgroup size is usually fixed
to a favourable size, and we round up the index space size in each
dimension to satisfy this divisibility requirement.
In the kernel code, user can specify that exta work-items in
each dimension simply return immediately without outputting any data.
Many highly data paralle computations in which access of memory for
arrays that peforms computation (example Vector-Vector Addition), the
OpenCL allows the local workgroup size to be ignoed by
the programmer and generated automatically by the implementation ; in this
case; the dveloper will pass NULL instead.
|
|
OpenCL Code : Multiplication of Scalar with Vector
The examples illustrate how to use the OpenCL APIs to
execute a kernel on a device, and algorithms that are used in Numerical computations. The examples
should not be considered as examples of how to address performance tuning based on OpenCL kernels
on target systems. Selective example programs will be made available during the laboratory Session.
|
Example Program : OpenCL (html)
|
The first OpenCL program that performs vector-vector multiplication. It illustrates
the basic programming steps with required amount of code. This code contains error
checks and error checks are incorporated. This code can be generalized.
// Info on Header files / definitions /Error Checks
#include <CL/cl.h>
#include <iostream>
#include <fstream>
#include <stdlib.h>
#include <cstring>
#include <stdlib.h>
#include <math.h>
#include <cstring>
using namespace std;
#define KERNEL_SOURCE_PATH "VectVectMult_kernel.cl"
#define GLOBAL_WORK_SIZE 4 // number of global items in work dimension
#define LOCAL_WORK_SIZE 4 // number of work items per group
// |
Error checking for after each OpenCL call
|
#define STATUSCHKMSG(x)
if(status != CL_SUCCESS)
{ cout<< "\n Operation is not successful :";
cout<< x<<" \n"; exit(1);}
// Subroutine to read OpenCL kernel source.
// // // // // //
// // // |
Read kernel source code from specified kernel source file. kernel source code is compiled
dynamically at runtime,
build & linked.
" KERNEL_SOURCE_PATH " is user defined
macro, which defines physical path to kernel source file. " readKernelSource " is
a function that reads kernel source and put into a character string.
Before an OpenCL kernel can be launched, its program source is compiled, and a handle to the
kernel is created. The string is used to build the kernel using the following APIs.
clCreateProgramWithSource()
clBuildProgram()
clCreateKernel()
|
// @param[in,out] path |
Path to the kernel OpenCL kernel
|
// return |
Return a character string representing OpenCL Kernel source code
|
// Subroutine to read kernel source code into character string
char* readKernelSource(char* path)
{
int srcLen;
char* sProgramSource;
ifstream srcFile;
srcFile.open(path, ifstream::in);
srcFile.seekg(0,ios::end);
srcLen = srcFile.tellg();
srcFile.seekg(0,ios::beg);
sProgramSource = ( char*) malloc( srcLen * sizeof(char));
srcFile.read(sProgramSource, srcLen);
return sProgramSource;
} // end of readKernelSource
// Subroutine to Set OpenCL programming environment and build kernel
// // // // //
|
SetExeEnv() function perform OpenCL context and other platform related setup like setting up device context, platform identification, selection of target device, call \93ReadKernelSource\94 function to read kernel source file into a character string ,create program and building kernel source code, binding kernel binary object with selected context. Please refer to setExeEnv() function for more detail
|
// @param[in,out] context |
Handle to current execution context
|
// @param[in,out] deviceListSize |
Hold device list length.
|
// @param[in,out] devices |
Handle to list of devices.
|
// @param[in,out] queue |
Handle to command queue to currently used context with specific device.
|
// @param[in,out] hProgram |
Handle to kernel source program
|
// @param[in,out] path |
Relative path to kernel source code
|
// @param[in,out] deviceType |
Targated device type.
|
//@ return |
On successful execution returns void or nothing
|
// Set UP function to include OpenCL Prog, APIs
void setExeEnv ( cl_context *context,
size_t *deviceListSize,
cl_device_id **devices,
cl_command_queue *queue,
cl_program *hProgram,
char* path,
cl_device_type deviceType)
{
cl_int status = CL_SUCCESS;
// // // // |
Querying Platform Info : Obtain the list of platforms available:
The function
clGetPlatformIDs (*,*,*)
gives list of platforms;
num_Platforms is the number of cl_platform_id entries that can be added to platforms. If platforms
is not NULL, the num_Platforms must be greater than zero
|
cl_uint num_platform_entries = 10,
cl_unit num_Platforms;
cl_platform_id platforms[num_platform_IDs];
status = |
clGetPlatformIDs(num_Platform_entries, platforms,
&num_platforms);
|
STATUSCHKMSG("clGetPlatformIDs Failed ");
// // |
Querying Platform Info : Gets specific information about the OpenCL
Platform.
Platform_Buffer[] indicates the size in bytes of memory pointed.
|
cl_platform_id platform;
cout << " Available OpenCL Platforms : \n";
for( unsigned i = 0; i < num_Platforms; ++i)
{
char Platform_Buffer[100];
status = |
clGetPlatformInfo( platforms[i], CL_PLATFORM_VENDOR,
sizeof(Platform_Buffer), Platform_Buffer, NULL);
|
STATUSCHKMSG("clGetPlatformIDs Failed");
if( (!strcmp(Platform_Buffer, "Advanced Micro Devices, Inc.")) || (!strcmp(Platform_Buffer,"NVIDIA Corporation")))
{
platform = platforms[i];
cout<<"\t"< < i+1 < < " ) " < < Platform_Buffer;
break;
}
}
// // // // // |
Contexts : creates an OpenCL context from a
device type .
Setting up the context properties; The function
clCreateContextFromType(*,*,*,*,*)
creates an OpenCL context from a
device type that identifies the specific device(s) to use. cprop
specifies a list of context property names and their corresponding values.
|
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform,0};
cl_context_properties* cprops = ( NULL == platform ) ? NULL : cps;
*context = |
clCreateContextFromType(cprops,deviceType,
NULL,NULL,&status);
|
STATUSCHKMSG(" context ");
// // // |
Contexts : get size of device list data
The function
clGetContextInfo (*,*,*,*,*)
can be used to query information about a context.
context specifies the OpenCL context being queried.
|
status = |
clGetContextInfo(*context, CL_CONTEXT_DEVICES, 0,NULL,
deviceListSize);
|
STATUSCHKMSG(" Device list");
// |
pointer to memory object which hold device info
|
(*devices) = (cl_device_id *) malloc ( *deviceListSize);
if( *devices == NULL ){
cout<<" Failed to allocate memory ";
exit(-1);
}
// |
Context :get device list info using clGetContextInfo(...)
|
status = |
clGetContextInfo(*context, CL_CONTEXT_DEVICES, *deviceListSize, *devices, NULL);
|
STATUSCHKMSG("device info");
// // // //
|
Command-queues : used to queue a set of operations
The function
cl_command_queue
clCreateCommandQueue
creates a command-queue on a specific device.
context must be a valid OpenCL context; device must be a device associated with
context
|
*queue = |
clCreateCommandQueue( *context, (*devices)[0], 0, &status);
|
STATUSCHKMSG("command queue");
// // |
Querying Device Info : :
Identify the device type .
The clGetDeviceInfo
gets specific information about an OpenCL device.
|
size_t retInfoSize;
cl_device_type inputDeviceType;
clGetDeviceInfo((*devices)[0], CL_DEVICE_TYPE, sizeof(cl_device_type), &inputDeviceType, &retInfoSize);
if( inputDeviceType == CL_DEVICE_TYPE_GPU )
cout<<"\n \n Device Type : CL_DEVICE_TYPE_GPU";
else if ( inputDeviceType == CL_DEVICE_TYPE_CPU)
cout<<" \n \n Device Type : CL_DEVICE_TYPE_CPU";
// // // //
// // // // // |
Program Objects : Create CL program using kernel source.
The function
clCreateProgramWithSource
creates a program object for a context and loads the source code specified by the text strings in
the strings array into the program object. The devices associated with the program object are the
devices associated with context .
context must be a valid OpenCL context.
A program object encapsulates the following information:An associated context;
A program source or binary;
The latest successfully built program executable;the list of devices for which the
program able is built; the build options used and a build log; and
the number of kernel objects currently attached.
|
const char* sProgramSource = readKernelSource(path);
size_t sourceSize[] = { strlen(sProgramSource) };
*hProgram = |
clCreateProgramWithSource(*context, 1, &sProgramSource, sourceSize, &status);
|
STATUSCHKMSG("create source handle");
// // // //
// // // |
Build Program : Building Program Executables .
The function
clBuildProgram (*,*,*,*,*,*)
builds (compiles & links) a program executable from the program source or binary for all the
devices or a specific device(s) in the OpenCL context associated with program. OpenCL allows
program executables to be built using the source or the binary.
clBuildProgram must be called
for program created using either clCreateProgramWithSource or
clCreateProgramWithBinary to build the program executable for one or more devices
associated with program
|
status = clBuildProgram( *hProgram,1, (*devices), NULL,NULL,NULL);
STATUSCHKMSG("build");
}// end of setExeEnv
//
Subroutine Input Array : Initialize input vectors with random input values
// @param[in] length |
Size of the input array
|
// @param[out] Input Array |
To be filled with Input Array with single prcession float values.
|
void fillInArray(cl_int *hInArray, size_t length)
{
for(size_t count=0; count< length; count++)
hInArray[count] = rand()%10;
} // end of fillinArray
/ Subroutine Input program Data : Parse command line argument
// to get the program input
// @param[in] argc
|
Number of command line argument.
|
// @param[out] argv
|
Command line arguments.
|
// @param[in,out] len
|
Pointer to input variable "length of vector"
|
// @param[in,out] deviceType |
Pinter to device type specification.
|
void readInput(int argc, char* argv[], size_t *len, cl_device_type *deviceType)
{
if( argc != 3 || (!(strcmp("GPU",argv[2]) ^ strcmp("CPU",argv[2]))))
{
cout<<"\n Error : Invalid Number of input argument"
<< <" \n Syntax : "
<< " \n Example : ./VectVectMult 8 GPU \n";
exit(-1);
}
else
{
if(!strcmp("GPU",argv[2])){
(*deviceType) = CL_DEVICE_TYPE_GPU;
}
else
{
(*deviceType) = CL_DEVICE_TYPE_CPU;
}
(*len) = atoi(argv[1]);
}
} // end of readInput
// Main Program Vector-Vector Multiplication
int main(int argc, char* argv[])
{
cl_device_type deviceType;
cl_int status = CL_SUCCESS;
size_t length;
// CALL function readInput() to get the program input
readInput(argc,argv,&length, &deviceType);
// Variable declaration
cl_context context;
size_t deviceListSize;
cl_device_id* devices;
cl_command_queue queue;
// CALL function setExeEnv() to set Prog. Env.
and build kernel
char path[100];
strcpy(path,KERNEL_SOURCE_PATH);
cl_program hProgram;
cout<< "\n---------------------------------------------------\n";
setExeEnv(&context, &deviceListSize, &devices, &queue, &hProgram, path, deviceType);
cout<<"\n---------------------------------------------------\n";
// // // // //
// // // // |
Kernel Objects : Create kernel handle
To create a kernel handle, use the function clCreateKernel (cl_program program,
const char *kernel_name,cl_int *errcode_ret)
program (hprogram) is a program object with a successfully built executable.
kernel_name (vectVectMult_kernel) is a function name in the program declared with the __kernel qualifier.
A kernel is a function declared in a program. A kernel is identified by the __kernel qualifier
applied to any function in a program. A kernel object encapsulates the specific __kernel
function declared in a program and the argument values to be used when executing this
__kernel function.
|
cl_kernel vectVectMult_kernel;
VectVectMult_kernel = |
clCreateKernel( hProgram, "vectVectMult_kernel", &status);
|
STATUSCHKMSG("kernel handle");
// // // //
|
Memory Objects : Create Memory Objects to hold input , output on host / device
A buffer object is created using the function
cl_mem clCreateBuffer(*,*,*,*.*)
Elements in a buffer are stored in sequential fashion and can be accessed using a pointer
by a kernel executing on a device.
|
// create input vector 1 on - Host : hIntVectOne / Device : dintVectOne
cl_int *hInVectOne;
hInVectOne = new cl_int[length];
fillInArray(hInVectOne, length);
cl_mem dInVectOne = |
clCreateBuffer(
context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
length * sizeof(cl_int), (void*) hInVectOne, &status);
|
STATUSCHKMSG("memory allocation Vect one");
// Create input vector 2 on - Host - hIntVectTwo / Device :
// dintVectTwo
cl_int *hInVectTwo;
hInVectTwo = new cl_int[length];
fillInArray(hInVectTwo, length);
cl_mem dInVectTwo = |
clCreateBuffer(
context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, length * sizeof(cl_int), (void*)hInVectTwo, &status);
|
STATUSCHKMSG("memory allocation vect two");
// Create space for holding vector length : Host & Device
cl_int hLength = length;
cl_mem dLength = |
clCreateBuffer(
context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_int),(void*)&hLength, &status);
|
STATUSCHKMSG("scalar memory setting");
// Create space for output on - Host : hOutScalar
// Device : dOutScalar
cl_int *hOutScalar;
hOutScalar = new cl_int[1];
cl_mem dOutScalar = |
<
clCreateBuffer(
context, CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_int),(void*)hOutScalar, &status);
|
STATUSCHKMSG("o/p memory allocation");
// // // // |
Kernel Objects : Setting Kernel Arguments :
To execute a kernel, the kernel arguments must be set.
The function
cl_int clSetKernelArg (*,*,*,*)
is used to set the argument value for a specific argument of a kernel.
kernel is a valid kernel object.
|
//set input vector argument
status = |
clSetKernelArg(VectVectMult_kernel,0,sizeof(cl_mem), (void*) &dInVectOne);
|
STATUSCHKMSG("in arg setting vectone");
//set output vector argument
status = |
clSetKernelArg(VectVectMult_kernel,1,sizeof(cl_mem), (void*) &dInVectTwo);
|
STATUSCHKMSG("in arg setting vecttwo");
//set length of two input vector argument
status = |
clSetKernelArg(VectVectMult_kernel,2,sizeof(cl_mem), (void*) &dLength);
|
STATUSCHKMSG("scalar value argument length");
//set output scalar value argument
status = |
clSetKernelArg(VectVectMult_kernel,3,sizeof(cl_mem), (void*) &dOutScalar);
|
STATUSCHKMSG("scalar value argument output value");
//set space for temporary value argument
status = |
clSetKernelArg(VectVectMult_kernel,4,sizeof(int), NULL);
|
STATUSCHKMSG("scalar value argument temp variable");
// Enqueue/Launch kernel
// number of global items in work dimension
size_t globalThreads[] = { GLOBAL_WORK_SIZE }; <
// number of work items per group
size_t localThreads[] = { LOCAL_WORK_SIZE};
// // // // // // // // // // //
|
Enqueue Kernel :
The runtime system assigns the work in the command queues to the underlying
devices. Commands are placed into the queue using the clEnqueue commands
The commands can be broadly classified into three categories:
Kernel commands (for example,
clEnqueueNDRangeKernel(),
etc.),
Memory commands (for example, clEnqueueReadBuffer(), etc.), and
Event commands (for example, clEnqueueWaitForEvents(), etc.
The function
clEnqueueNDRangeKernel ()
enqueues a command to execute a kernel on a device.
|
status = |
clEnqueueNDRangeKernel( queue,VectVectMult_kernel,1,
NULL, globalThreads, localThreads,0,NULL,NULL);
|
STATUSCHKMSG("kernel enqueue");
// Wait for kernel execution to finish
status = clFinish(queue);
STATUSCHKMSG("clFinish");
cl_event events[1];
// Read output result from device to host
status = |
clEnqueueReadBuffer(queue, dOutScalar, CL_TRUE, 0,
sizeof(cl_int), hOutScalar, 0, NULL, &events[0]);
|
STATUSCHKMSG("read output");
// Wait for read buffer to complete the read of output produce by
kernel
status = clWaitForEvents(1, &events[0]);
STATUSCHKMSG("read event not completed");
// Print the Input & Output vectors
cout<<"\n Input Vector One : \n";
for(size_t count=0; count ≤ length; count++)
{
cout < <" "<< hInVectOne[ count ];
}
cout<< "\n Input Vector Two \n";
for(size_t count=0; count ≤length; count++)
{
  cout<<" "<< hInVectTwo[ count];
}
cout <<"\n Output : "<< (*hOutScalar) << "\n";
// Releasing memory objects.
clReleaseMemObject(dInVectTwo);
clReleaseMemObject(dLength);
clReleaseMemObject(dOutScalar);
clReleaseKernel(vectVectMult_kernel);
clReleaseProgram(hProgram);
clReleaseCommandQueue(queue);
clReleaseContext(context);
delete [] hInVectOne;
delete [] hInVectTwo;
delete [] hOutScalar;
} // end of main
// VectVectMult_kernel.cl : Function __kernel void vectVectMult_kernel
// // //
|
Kernel: A kernel is a function declared in a program and executed on an OpenCL device. A
kernel is identified by the __kernel qualifier applied to any function defined in a program.
OpenCL Kernel implementation, to find to vector vector multiplication.
|
// @param[in] inVectOne |
Handle to input matrix One
|
// @param[in] inVectTwo |
Handle to input matrix two
|
// @param[out] length |
Handle to variable, defining lenght of input vectors
|
// @param[in] outScalar |
Handle to output scalar variable.
|
// @param[in] tempScalar |
Handle to temporary scalar variable.
|
__kernel void vectVectMult_kernel( |
__global int *inVectOne, __global int *inVectTwo,
__global int *length, __global int *outScalar,
__local int *tempScalar)
|
// Synchornization functions
// // // // // //
|
All work-items in a work-group executing the kernel
on a processor must execute this function before any
are allowed to continue execution beyond the barrier.
This function must be encountered by all work-items in
a work-group executing the kernel.
CLK_LOCAL_MEM_FENCE : The barrier function will either
flush any variables stored in local memory or queue a memory fence to ensure
correct ordering of memory operations to local memory.
|
{
// get unique global work item ID
unsigned int gid = get_global_id(0);
//get the number of global work items
unsigned int global_work_items = get_global_size(0);
(*tempScalar) = 0;
for(int currCell = gid; currCell < (*length); currCell+=global_work_items )
{
(*tempScalar) = (*tempScalar) + inVectOne[ currCell] * inVectTwo[currCell];
}
inVectTwo[gid] = (*tempScalar);
barrier(CLK_LOCAL_MEM_FENCE);
if( gid == 0 )
{
*outScalar = 0;
for( int cellInd = 0; cellInd< global_work_items ; cellInd++)
{
*outScalar = (*outScalar) + inVectTwo[cellInd];
}
}
} // end vectVectMult_kernel
|
References
1.
|
AMD Fusion
|
2.
|
APU
|
3.
|
All about AMD FUSION APUs (APU 101)
|
4.
|
AMD A6 3500 APU Llano
|
5.
|
AMD A6 3500 APU review
|
6.
|
AMD APP SDK with OpenCL 1.2 Support
|
7.
|
AMD-APP-SDKv2.7 (Linux)
with OpenCL 1.2 Support
|
8.
|
AMD Accelerated Parallel Processing Math Libraries (APPML)
|
9.
|
AMD Accelerated Parallel Processing (AMD APP) Programming Guide OpenCL : May 2012
|
10.
|
MAGMA OpenCL
|
11.
|
AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream)
with AMD APP Math Libraries (APPML); AMD Core Math Library (ACML);
AMD Core Math Library for Graphic Processors (ACML-GPU)
|
12.
|
Getting Started with OpenCL
|
13.
|
Aparapi - API & Java
|
14.
|
AMD Developer Central - OpenCL Zone
|
15.
|
AMD Developer Central - SDKs
|
16.
|
ATI GPU Services (AGS) Library
|
17.
|
AMD GPU - Global Memory for Accelerators (GMAC)
|
18.
|
AMD Developer Central - Programming in OpenCL
|
19.
|
AMD GPU Task Manager (TM)
|
20.
|
AMD APP Documentation
|
21.
|
AMD Developer OpenCL FORUM
|
22.
|
AMD Developer Central - Programming in OpenCL - Benchmarks performance
|
23.
|
OpenCL 1.2 (pdf file)
|
24.
|
OpenCL\99 Optimization Case Study Fast Fourier Transform - Part 1
|
25.
|
AMD GPU PerfStudio 2
|
26.
|
Open Source Zone - AMD CodeAnalyst Performance Analyzer for Linux
|
27.
|
AMD ATI Stream Computing OpenCL - Programming Guide
|
28.
|
AMD OpenCL Emulator-Debugger
|
29.
|
GPGPU :
http://www.gpgpu.org
and Stanford BrookGPU discussion forum
http://www.gpgpu.org/forums/
|
30.
|
Apple : Snowleopard - OpenCL
|
31.
|
The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group
|
32.
|
Khronos V1.0 Introduction and Overview, June 2010
|
33.
|
The OpenCL 1.1 Quick Reference card.
|
34.
|
OpenCL 1.2 Specification Document Revision 15) Last Released November 15, 2011
|
35.
|
The OpenCL 1.2 Specification (Document Revision 15) Last Released November 15, 2011
Editor : Aaftab Munshi Khronos OpenCL Working Group
|
36.
|
OpenCL1.1 Reference Pages
|
37.
|
MATLAB
|
38.
|
OpenCL Toolbox v0.17 for MATLAB
|
39.
|
NAG
|
40.
|
AMD Compute Abstraction Layer (CAL) Intermediate Language (IL)
Reference Manual. Published by AMD.
|
41.
|
C++ AMP (C++ Accelerated Massive Parallelism)
|
42.
|
C++ AMP for the OpenCL Programmer
|
43.
|
C++ AMP for the OpenCL Programmer
|
44.
|
MAGMA SC 2011 Handout
|
45.
|
AMD Accelerated Parallel Processing Math Libraries (APPML) MAGMA
|
46.
|
Benedict R Gaster, Lee Howes, David R Kaeli, Perhadd Mistry Dana Schaa
Heterogeneous Computing with OpenCL, Elsevier, Moran Kaufmann Publishers, 2011
|
47.
|
Programming Massievely Parallel Processors - A Hands-on Approach,
David B Kirk, Wen-mei W. Hwu
nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011
|
48.
|
OpenCL Progrmamin Guide,
Aftab Munshi Benedict R Gaster, timothy F Mattson, James Fung,
Dan Cinsburg, Addision Wesley, Pearson Education, 2012
|
49.
|
AMD gDEBugger
|
50.
|
The HSA (Heterogeneous System Architecture) Foundation
|
|
|
|
|
|