hyPACK-2013 Mode 2 : OpenCL Prog. on CPUs & GPUs
The OpenCL stands for Open Computing Language.
OpenCL (Open Computing Language) is the first open, royalty-free standard for general-purpose parallel programming
across CPUs,
GPUs and other processors. OpenCL is the first open, royalty-free
standard for general-purpose parallel programming of heterogeneous systems.
The Open Computing Language is a framework for writing programs that execute across heterogeneous platforms
consisting of CPUs, GPUs, and other processors.
The OpenCL provides an opportunity for developers to effectively use the multiple heterogeneous compute resources on
their CPUs, GPUs and other processors.
OpenCL
supports wide range of applications, ranging from embedded and consume software to HPC solutions, through a low-level,
high-performance, portable abstraction. It is expected that OpenCL will form the foundation layer of a parallel
programming eco-system of platform-independent tools, middleware, and applications.
OpenCL is being created by the
Khronos group
with the participation of many industry-leading companies such as AMD, Apple, IBM, Intel,
Imagination Technologies, Motorola, NVIDIA and others.
OpenCL is the first open, royalty-free standard for cross-platform,
parallel programming of modern processors found in personal computers, servers
and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves
speed and responsiveness for a wide spectrum of applications in various
discipline areas like gaming & entertainment as well as scientific and medical software.
OpenCL 1.2
The OpenCL new version (November 2011) provides enhanced performance and functionality for
parallel programming in a backwards compatible specification that is the
result of cooperation by over thirty industry-leading companies. Khronos has
updated and expanded its comprehensive OpenCL conformance test suite to ensure that
implementations of the new specification provide a complete and reliable platform for
cross-platform application development.
Khronos Released OpenCL 1.2 Specification in SC11 November 2011.
OpenCL 1.2 enables significantly enhanced parallel programming flexibility, functionality
and performance through many updates and additions including:
-
Device partitioning - enabling applications to partition a device into sub-devices
to directly control work assignment to particular compute units, reserve a part of the
device for use for high priority/latency-sensitive tasks, or effectively use shared
hardware resources such as a cache;
-
Separate compilation and linking of objects - providing the capabilities and
flexibility of traditional compilers enabling the creation of libraries of OpenCL
programs for other programs to link to;
-
Enhanced image support - including added support for 1D images and 1D & 2D
image arrays. Also, the OpenGL sharing extension now enables an OpenCL image to be
created from OpenGL 1D textures and 1D & 2D texture arrays;
-
Built-in kernels represent the capabilities of specialized or non-programmable hardware
and associated firmware, such as video encoder/decoders and digital signal processors,
enabling these custom devices to be driven from and integrated closely with the
OpenCL framework;
The OpenCL Architecture
OpenCL provides a uniform programming environment for software developers to write efficient, portable code for
high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs,
GPUs, Cell-type architectures and other parallel processors such as DSPs. OpenCL is a framework for parallel programming
and includes a language, API, libraries, and runtime system to support software development.
Using OpenCL, a programmer can write general purpose programs that execute on GPUs without need to map their algorithms onto
a 3D graphics API such as OpenGL.
The OpenCL architecture model use hierarchy of models such as platform Model, Memory Model,
Execution Model, and Programming model. The important points are described below.
Platfom Model
-
The Platform model consists of a host connected to one or more OpenCL devices.
-
An OpenCL device is divided into one or more compute units (CUs), which are further divided into one or
more processing elements (PEs). Computations on a device occur within the processing elements.
-
An OpenCL application runs on a host according to the models, particular to the host platform.
-
The OpenCL application submits commands from the host to execute computations on the processing
elements within a device.
-
The processing elements within a compute unit execute a single stream of instructions as SIMD units or SPMD units.
Execution Model
-
Execution of an OpenCL program occurs in two parts: a host program that executes on the particular host
platform and kernels that execute on one or more OpenCL devices.
-
The core of the OpenCL execution model is defined by the kernels execute. The concepts of kernel instance are called a
work-item and these work-items are organised into Work-groups.
-
Execution Model: Context and Command Queues - The host defines a context for the execution of the kernels.
The context includes
Devices : The collection of OpenCL devices to be used by the host;
Kernels The OpenCL functions that run on OpenCL devices);
Program Objects: The program source and executable that implement the kernels);
Memory Objects : A set of memory objects visible to the host and the OpenCL devices Memory objects contain values that
can be operated on by instances of a kernel.
-
The OpenCL execution model supports two categories of kernels: OpenCL Kernels & Native Kernels.
|
The OpenCL Framework
The OpenCL Platform Layer
The OpenCL Runtime
-
Command Queues
-
Memory Objects
- Creating Buffer Objects,
- Reading, Writing and copying Buffer Objects
- Retaining and Releasing Memory Objects
- Creating Image Objects,
- Querying List of Supported Image formats,
- Reading, Writing and Copying Image objects
- Copying between Image and Buffer Objects
- Mapping and Unmapping Memory Objects
- Memory Object Queries
-
Sampler Objects
-
Program Objects
- Creating Program Objects
- Building Program Executables
-
Build Options
- Options (Preprocessor,
Math Intrinsic,
Optimisation)
- Uploading the OpenCL compiler
- Program Object Queries
-
Kernel Objects
- Creating Kernel Objects
- Setting Kernel Arguments
- Kernel Object Queries
-
Executing Kernels
-
Event Objects
-
Profiling Operations on Memory Objects and Kernels
-
Flush and Finish
The OpenCL Compiler : Building & Running Programs
The Compiler tool-chain provides a common framework for both CPUs & GPUs, sharing the front-end and some high-level
compiler transformations. The back-ends are optimized for the device type (CPU or GPU). Most of the application remains
same, but OpenCL APIs are included at various parts of the code. The kernels are compiled by the OpenCL compiler to
either CPU binaries or GPU binaries, depending on that target device.
-
CPU Processing :For CPU processing, the OpenCL runtime uses the LLVM AS ( Low-level virtual Machine )
to generate x86 binaries. The OpenCL runtime automatically determines the number of processing elements or cores, present in the
CPU and distributes the OpenCL kernel between them.
-
GPU Processing : For GPU processing, the OpenCL runtime layer generates GPU specific AMD -ATI binaries with CAL or
CUDA enabled NVIDIA architecture GPU binaries.
Compiling Program
-
An OpenCL application consists of a host program
(C/C++)
and an optional kernel program
(.cl).
To compile an OpenCL application, the host program
must be compiled and this can be done using the off-the-shelf compiler such as
g++
or
MSV++.
The application kernels are compiled into device-specific binaries using the OpenCL compiler.
The compiler uses a standard C front-end as well as the LLVM framework, with extensions for OpenCL.
-
To compile OpenCL applications on Windows requires that Visual Studio 2008 Professional Edition
and the Intel C compiler and all C++ files must be added with appropriate settings. To compile
OpenCL applications on Linux requires that the
gcc or the Intel C compiler is
installed and all C++ files must be compiled with appropriate settings on 32-bit /64-bit systems.
The OpenCL Library and runtime environment depends upon the target GPU (i.e CUDA enabled NVIDIA or AMD ATI - Stream DSK).
Running Program
An OpenCL application is compiled on the target system, the runtime system assigns the work in the command queues
to the underling devices. Commands are placed into the queue using the
clEnqueue
commands
shown below. The commands can be broadly classified into three categories.
-
kernel commands (for example,
clEnqueueNDRangeKernel(), etc.).
-
Memory commands (for example,
clEnqueueNDReadBuffer(), etc.), and
-
Memory commands (for example,
clEnqueueWaitForEvents(),etc.
An OpenCL application can create multiple command queues and please refer OpencL specification or
OpenCL Programming Guide for the CUDA Architecture or AMD ATI Stream computing OpenCL Programming Guide.
The OpenCL Software & Programming Model
An Overview - Data & Task Programming Models :
Two parallel programming models commonly used are implicit, and explicit. A most popular approach of implicit parallelism is the
automatic parallelization of sequential programs by compilers. Such compilers reduce the burden of programmer to explicitly parallelize
the program. Three explicit parallel programming models are data-parallel, shared-variable and message passing.
The data-parallel model applies to either SIMD or SPMD modes. The idea is to execute the same instruction of program
segment over different data sets simultaneously on multiple computing nodes. In data parallelism, the data structure
is distributed among the processors, and the individual processes execute the same instructions on their parts of the data structure.
This approach is extremely well suited to SIMD machines. One of its most
attractive aspects is that for very regular structure it is possible for the user program to simply indicate that the structure
should be distributed across the processes, and the compiler will automatically replace the user directive with code
that distributes the data and performs the data-parallel operations.
The task-parallel model applies to many problems, the underlying task graph naturally contains sufficient degree of concurrency.
Given such a graph, tasks can be scheduled on multiple processors to solve the problem in parallel. Unfortunately, there are many problems
for which the task graph consists of only one task, or multiple tasks that need to be executed sequentially. For such problems,
we need to split the overall computations into tasks that can be performed concurrently. The process of splitting the computations in
a problem into a set of concurrent tasks is referred to as decomposition. A good decomposition should have o high degree of concurrency
as well as the interaction among tasks should be as little as possible.
The OpenCL Data & Task Programming Model :
OpenCL supports data and task parallel programming models as well as hybrid programming models.
In the data parallel programming model, a computation is defined in terms of a sequence of instructions executed on multiple elements of a
memory object. These elements are in index space as explained in execution model of OpenCL. This defines how that execution maps onto
the work-items. The OpenCL data-parallel programming model is hierarchical. The hierarchical subdivision can be specified in two ways.
In Explicit programming, the developer defines the total number of work-items to execute in parallel as well as the division of work-items into
specific work groups.
In the Implicit programming, the developer specifies the total number of work-items to execute in parallel, and OpenCL manages the division into
work-groups.
In the Task-Parallel Programming model, a kernel instance is executed independent of any index space. This is equivalent to executing a kernel on a compute device with
a work-group and NDRange containing a single work-item. Parallelism is expressed using vector data types implemented by the device,
enqueuing multiple tasks, and/or enqueuing native kernels developed using a programming model orthogonal to OpenCL.
Hardware Overview:
General block diagram of generalized GPU compute device consists of number of compute units and each compute unit contains number of cores,
which are responsible for executing kernels, each operating on independent data stream. Different GPU compute devices have different
characteristics (NVIDIA, AMD). Each core contains numerous
Processing elements.
Programming Model:
The OpenCL programming Model is based on the notion of a host device, supported by an application API and a number of devices connected
through a bus. These are programmed using OpenCL C language. The host API is divided into platform and runtime layers. OpenCL C
is a C-like language with extensions for parallel programming such as memory fence operations and barriers. The typical OpenCL model consists
of information such as queues of commands, reading / writing data, and executing kernels for specific devices.
The devices are capable of running data- and task- parallel work. A kernel can be executed as a function of multi-dimensional of indices.
Each element is called a work-item, the total number of indices as defined as the global work-size.
The global work-size can be divided into sub-domain, called work-groups, and shared memory. Work items are synchronized through barrier or
fence operations.
The OpenGL supports two domains of synchronization:
- 1. Work-items in a single work-group,
- 2. Commands enqueued to command-queues(s) in a single context.
How an OpenCL application is built ? :
-
First querying the runtime to determine which platforms are present. There can be any number of
different OpenCL implementations are present.
-
Create a context (The OpenCL Context has associated with it a number of compute devices
such as CPU or GPU devices)
Within a context, OpenCL guarantees a relaxed consistency between these devices.
This means that memory objects, such as buffers or images, are allocated per context, but changes made by one device are only
guaranteed to be visible by another device at well-defined synchronization points.
-
OpenCL provides events, with the ability to synchronization on a given event to enforce the correct order of execution,
-
Many operations are performed with respect to a given context: there also are many operations there are specific to a device.
For example, program compilation and kernel execution are done on a peer-device basis.
-
Performing work with a device, such as executing
kernels or moving data to end from the device's local memory is done using a corresponding a command queue.
-
A command queue is associated with a single device and a given context. : all work for a specific device is done through this
interface. Note that while a single command queue can be associated with only a single device. For example, it is possible to have one
command queue for executing kernels and a command kernel for managing data transfers between the host and the device.
Most OpenCL program follows the same pattern. Given a specific platform, select a device or devices to create a context, allocate memory,
create device-specific command queues to create a context, allocate memory, create device-specific command queues, and perform data
transfers & computations.
Generally, the platform is the gateway to accessing specific devices, given these devices and a corresponding
context the application is independent of the platform. Given a context, the application can:
-
Create a command queues
-
Create programs to run on one or more associated devices
-
Create kernels within those programs
-
Allocate memory buffers or image, either on the host or on the device(s) (memory can be copied between the host and device)
-
Write data to the device
-
Submit the kernel (with appropriate arguments) to the command queue for execution.
-
Read data back to the host form the device.
The relationship between context(s), device(s), buffer(s), program(s), kernel(s), and command queue(S) is best seen by looking at
simple code.
An Overview of Basic Programming Steps :
Given below, illustrate the basic programming steps required for a minimum amount of code.
Many test programs might require similar steps and these steps do not include error checks.
1.
|
The host program must select a platform, which is an abstraction for a given OpenCL, implementation.
Implementations by multiple
vendors can coexist on a host. Developer can use
clGetPlatformIDs(..) API to get a platform.
|
2.
|
A device id for GPU devices is requested.
Developer can use clGetDeviceIDs(..) API to
find a gpu device.
A CPU device could be requested by using
CL_DEVICE_TYPE_CPU instead.
The device can be a physical device, such as a given GPU, as an abstracted device, such has the collection of all CPU code cores
on the host.
|
3.
|
On the selected device, an OpenCL context is created.
Developer can use clCreateContext(..) API to
create a context.
A context ties together a device memory buffers related to
that device. OpenCL programs, and command queues. Note that buffers related to a device can reside on either the host or he device.
Many OpenCL programs have only a single context, program, and command queue.
Developer can use
clCreateCommandQueue(..) API
to create a command queue.
|
4.
|
Before an OpenCL kernel can be launched, its program source is compiled, and a handle to the kernel is
created.
Developer can use
clCreateProgramWithSource(..) API
to perform runtime source compilation, and obtain kernel entry point
|
5.
|
A memory buffer is allocated on the device as per program requirements.
Developer can use
clCreateBuffer(..) API
to create a data buffer.
|
6.
|
The kernel is launched.
Developer can use
clEnqueueNDRangekernel(..) API
to launch the kernel. Let the kernel pick-up local work size.
While it is necessary to specify the global work size. OpenCL determines good local work size for
this device. Since the kernel was launch asynchronously.
clFinish() is used to wait for completion.
|
7.
|
The data is mapped to the host for examination. Calling
ClEnqueeMapbuffer(..)
ensues the visibility of the buffer on the host.,
which in this case probably includes a physical transfer.
Alternatively, we could use
clEnqueueWriteBuffer(..),
which requires a pre-allocates host-side buffer.
|
OpenCL MAGMA
An Overview MAGMA - OpenCL :
MAGMA Version 1.2.0 (Matrix Algebra on GPU and Architectures) is OpenCL implementations for MAGMA's
one-sided dense matrix factorizations (LU, QR, and Cholesky).
The MAGMA research project addresses the complex challenges of the
emerging hybrid environments, optimal software solutions,
combining the strengths of different algorithms within a single framework.
MAGMA's support to include AMD GPUs.
Visit URL
MAGMA for more information.
The clMAGMA library dependancies,
in particular optimized GPU OpenCL BLAS and CPU optimized BLAS and LAPACK for
AMD hardware. The details of implementation on AMD GPUs
can be found in the
AMD Accelerated Parallel Processing Math Libraries (APPML).
|
The OpenCL C Programming Language & Numerical Compliance
-
OpenCL C programming language used to create kernels that are executed on OpenCL device(s). The OpenCL
C programming language is based on the ISO/ISC 9899:1999 C language Specification with specific
extensions and restrictions. Please refer to the OpenCL Specification Version 1.0, Khoronos
OpenCL Working Group.
-
The OpenCL support various Data Types, Conversations & Type Casting, Operators, Vector Operations,
Address Space Qualifiers, Image Access Qualifiers, Function Qualifiers, rules for use of pointers (restrictions),
Preprocessor Directives and Macros, Attribute Qualifiers, and Built-in functions. Please refer to the OpenCL Specification
Version 1.0, Khoronos
OpenCL Working Group.
-
The OpenCL provides the functionality that must be supported by all OpenCL devices for single precision
floating point numbers. Double precision floating-point is an optional extension. Please refer to the
OpenCL Specification Version 1.0, Khoronos
OpenCL Working Group.
List of Programs : OpenCL on Host-CPUs ↦ GPUs (NVIDIA /AMD-APP)
The examples illustrate how to use the OpenCL APIS to
execute a kernel on a device, and algorithms that are used in Numerical computations. The examples
should not be considered as examples of how to address performance tuning based on OpenCL kernels
on target systems. Selective example programs will be made available during the laboratory Session.
-
OpencL program to find the total number of work-items in the x- and y-dimension of the
NDRanges (Assume that OpenCL kernel is launched with a two-dimensional (2D) NDRange. Use
API
get_global_size(0),
get_global_size(1) of OpenCL.
-
OpenCL program to device a query that returns the constant memory size supported by the device
-
OpenCL program to get unique global index of each work item that calls the function get_global_id(0) of
OpenCL API.
-
Write a OpenCL program to measure the time taken for different data sizes that copy (blocked read)
data from the device memory to the pinned host memory using clEnqueueReadBuffer()
-
Write a OpenCL program to measure the time taken for different data sizes that copy ( write)
data from the pinned host memory to the device memory using clEnqueueWriteBuffer()
-
Measure time for OpenCL (blocking or non-blocking calls) and kernel executions using either
CPU or GPU timers (OpenCL GPU timers or OpenCL events)
-
Code to measure the effective bandwidth for a 1024x1024 matrix (Single /Double Precision) using
Fast Memory Path and CompletePath and measure Ratio to Peak Bandwidth
-
Write a program to calculate memory throughput using OpenCL visual profiler
-
Analyze the differences in calculation of efficient memory bandwidth with memory throughput using OpenCL
visual profiler
-
Write a code to analyze the performance of highly data parallel computation such as matrix-matrix computations on
GPUs in which each multiprocessor contains either 8,192 or 16,384 32-bit registers, these are partitioned among
concurrent threads.
-
OpenCL Program to measure highest bandwidth between host and device based on
page-locked or pinned memory
-
OpenCL Program to measure highest bandwidth between host and device based on page-locked or pinned memory
using blocking (synchronous transfer) clEnqueueReadBuffer() / clEnqueueWriteBuffer() call
-
OpenCL Program to measure highest bandwidth between host and device based on page-locked or pinned memory
using non-blocking write (Asynchronous transfer) clEnqueueWriteBuffer() call with parameter CL_FALSE and blocking read from device to host using clEnqueueReadBuffer() call with parameter CL_TRUE.
-
OpenCL Program to measure highest bandwidth between host and device based on page-locked or pinned memory
using non-blocking write (Asynchronous transfer) clEnqueueWriteBuffer() call with parameter CL_FALSE and blocking read from device to host using clEnqueueReadBuffer() call with parameter CL_TRUE.
-
OpenCL : OpenCL program on Overlapping Transfers and Device Computation oclCopyComputeOverlap SDK.
The SDK sample "oclCopyComputeOverlap" is devoted exclusively to exposition of the techniques required to
achieve concurrent copy and device computation and demonstrates the advantages of this relative to purely synchronous operations.
-
Write test suites focusing on performance and scalability analysis of different size of the data
on different memory spaces i.e. 16 KB per thread limit on local memory, a (OpenCL __private memory), 64 KB of constant memory (OpenCL_constant memory), 16 KB of share memory (OpenCL_local memory), and either 8,192 or 16,384 32-bit registers per multi-processor.
-
Write a code on how to use default work-group size at ccompile time, size of the work-group, role of
compiler to allocate the number of registers for work-item, & enough number of wave fronts
-
OpenL program for Matrix into Matrix Computation for different partition of matrices for coalescing global
memory accesses (a) A Simple Access Pattern - the kth thread accesses the kth word in a segment; the exception
is that not all threads need to participate; (b). A Sequential but Misaligned Access Pattern (sequential threads in a half warp access memory but not aligned with the segments); (c) Effects of misaligned accessesl (d) Stride Access
-
Simple OpenCL program that computes matrix into vector multiply choosing the best optimized NDRange in which
optimized number of work-items is launched. Setting up right worksize in clEnqueueNDRangeKernel()) and number of work-items, a multiple of the warp size (i.e. 32), can be explored.
-
Simple OpenCL program that computes the product w of a width x height matrix
M by a vector V in which global
memory access are coalesced and the kernel must be rewritten to have each work-group,
as opposed to each work-item, compute elements of W. (Each work-item is now responsible for
calculating part of the dot product of V and a row of M and storing it to OpenCL local memory)
-
OpenCL Parallel reduction (a) with shared memory Effects of Misaligned Accesses bank conflicts, (b)
without shared memory Effects of Misaligned Accesses bank conflicts, (c) warp based parallelism
-
OpenCL Code for matrix-matrix multiply C = AAT based on (a) strided accesses to global memory, (b)
shared memory bank conflicts, ( an optimized version using coalesced reads from global memory)
-
Example programs based on Numerical Linear Algebra using AMD-APP OpenCL.
Example programs based on special class of problems- Dense &. Sparse Matrix Computations,
Fast Search Algorithms, & Partial Differential Eqs.(PDEs) will be discussed using
AMD-APP OpenCL of HPC GPU Cluster.
Selective example programs on numerical and non-numerical computations
using AMD - APP SDK OpenCL.
Example programs based on AMD APP - Aparapi Data Parallel workloads in Java
Implementation of String Search Algorithms -
CUDA/OpenCL enabled NVIDIA GPUs and OpenCL AMD-APP GPUs of HPC GPU Cluster
Solution of Partial Differential Equations (Poisson Equation in two dimensional
& three dimensional regions) by finite element Method (FEM) using
OpenCL on HPC GPU Cluster.
Implementation of Matrix Computations (Iterative Methods to solve Ax=b Matrix System of linear equations)
on Multi-GPUs
OpenCL - Perform sparse Matrix into vector multiplication Kernel
AMD-APP (using CAL) OpenCL : Write a simple HelloCAL application program using CAL of AMD Accelerated
Parallel Processing Technology
AMD-APP: Write a Direct memory access (DMA) code to move data between the system memory and GPU local
memory using
CALMemCopyof AMD APP
AMD-APP : Write a code to use AMD-APP Asynchronous operations using CAL API of an application that must
perform CPU computations in the application thread and also run another kernel on the AMD-APP stream processor
AMD-APP : Multiple device Write a matrix into vector multiplication based on self-scheduling algorithm using AMD-APP CAL CAL
Application using Multiple Stream Processors Using
calDeviceGetCount
AMD-APP : CAL API of ADM-APP and ensure
that application-created threads that are created on the CPU and are used to manage the communication with
individual stream processors.
AMD-APP :Write a matrix into matrix computation on Hybrid Computing (HC) platform in which the computations of
matrix into matrix multiply is performed judiciously on host-cpu using ACML -math library of AMD Opteron processor
and CAL-OpenCL of AMD-ATI (GPU) APP.
AMD-APP Obtain maximum achievable matrix into matrix computation on Hybrid Computing (HC) platform in which
the computations of block matrix into matrix multiply is performed judiciously using ACML -math library of
AMD Opteron processor on host-cpu and tiled blocked matrix into matrix multiplication based on CAL-OpenCL
on AMD-ATI (GPU) APP.
OpenCL Implementation of Solution of Partial differential Equations by finite differene method
OpenCL Implementation of Image Processing - Edge Detection Algorithms
OpenCL Implementation of String Search Algorithms
NVIDIA - Web Sites
1.
|
NVIDIA Kepler Architecture
|
2.
|
NVIDIA CUDA toolkit 5.0 Preview Release April 2012
|
3.
|
NVIDIA Developer Zone
|
4.
|
RDMA for NVIDIA GPUDirect coming in CUDA 5.0 Preview Release, April 2012
|
5.
|
NVIDIA CUDA C Programmig Guide Version 4.2 dated 4/16/2012 (April 2012)
|
6.
|
Dynamic Parallelism in CUDA Tesla K20 Kepler GPUs - Prelease of NVIDIA CUDA 5.0
|
7.
|
NVIDIA Developer ZONE - CUDA Downloads CUDA TOOLKIT 4.2
|
8.
|
NVIDIA Developer ZONE - GPUDirect
|
9.
|
Openacc - NVIDIA
|
10.
|
Nsight, Eclipse Edition Pre-release of CUDA 5.0, April 2012
|
11.
|
NVIDIA OpenCL Programming Guide for the CUDA Architecture version 4.0 Feb, 2011 (2/14,2011)
|
12.
|
Optmization : NVIDIA OpenCL Best Practices Guide Version 1.0 Feb 2011
|
13.
|
NVIDIA OpenCL JumpStart Guide - Technical Brief
|
14.
|
NVIDA CUDA C BEST PRACTICES GUIDE (Design Guide) V4.0, May 2011
|
15.
|
NVIDA CUDA C Programming Guide Version V4.0, May 2011 (5/6/2011)
|
16.
|
NVIDIA GPU Computing SDK
|
17.
|
Apple : Snowleopard - OpenCL
|
18.
|
The OpenCL Specification, Version 1.1, Published by Khronos OpenCL
Working Group, Aaftab Munshi (ed.), 2010.
|
19.
|
The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group
|
20.
|
Khronos V1.0 Introduction and Overview, June 2010
|
21.
|
The OpenCL 1.1 Quick Reference card.
|
22.
|
OpenCL 1.1 Specification (Revision 44) June 1, 2011
|
23.
|
The OpenCL 1.1 Specification (Document Revision 44) Last Revision Date : 6/1/11
Editor : Aaftab Munshi Khronos OpenCL Working Group
|
24.
|
OpenCL Reference Pages
|
25.
|
MATLAB
|
26.
|
NVIDIA - CUDA MATLAB Acceleration
|
27.
|
Jason Sanders, Edward Kandrot (Foreword by Jack Dongarra)
CUDA BY EXAMPLE - An Introduction to General Purpose GPU Programnming,
Addison Wessely 2011, nvidia
|
28.
|
David B Kirk, Wen-mei W. Hwu
nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011
|
29.
|
OpenCL Toolbox for MATLAB
|
30.
|
NAG
|
AMD APP - OpenCL Web Sites
|
|