hyPACK-2013 Mode-4: GPU Computing with CUDA enabled NVIDIA GPUs
It is well-known that the computational power of GPUs has widely attracted the scientific community and GPUs
provide unprecedented computational power to solve the data intensive applications.
The use of the graphical Processing Unit (GPU) to
accelerate non-graphics computations has drawn much attention.
This is due to the fact that the computational power of GPUs has exceeded that of PC-based CPUs by more
than one order of magnitude while being available for a comparable price.
CUDA 5.0 is used for development of programs in the lab. sessions and tuning & optimisation techniques
are employed to extract the performance of application kernels.
Click here ...... to know more about CUDA GPU computing/Codes
|
CUDA - NVIDIA GPU Prog. Overview
The NVIDIA
's Compute Unified Device Architecture (CUDA) is a software platform for massively
parallel high-performance
computing on the company's powerful GPUs.
The NVIDIA CUDA technology is a fundamentally new computing architecture that enables the GPU to solve complex
computational problems.
CUDA technology gives computationally intensive applications access to the processing power of NVIDIA graphics
processing units (GPUs) through a new, programming interface. CUDA is a software platform for massively parallel high-performance
computing on the NVIDIA's powerful GPUs. The game community has been using the NVIDIA's GPUs and graphics cards (NVIDIA's GeForce,
Quadrobrand and Tesla, Fermi brand products) since long time.
CUDA requires programmers to write special code for parallel processing but it doesn't require them to explicitly manage threads,
which simplifies the programming model. CUDA includes C/C++ Software development tools, functions libraries and a hardware abstraction
mechanism that hides the GPU hardware from developers.
Selective Scientific and Engineering applications, which come, fall in the category of Data intensive as well as embarrassingly
parallel and Consumer market applications (Gaming, Video) may require single precision floating point mathematical operations.
CUDA provides solution for such applications and NVIDIA's new GPU which supports double precision floating point mathematical
operations can address broader class of applications. The NVIDIA Tesla cards are becoming popular in high-performance
computing applications.
|
CUDA Programming Model
CUDA Programming model automatically manages the threads and it is significantly differs from single threaded CPU code and to some extent even the parallel code. Before availability of NVIDIA's CUDA, some of the users in Parallel Processing Community write codes for GPU. Efficient CUDA programs exploit both thread parallelism within a thread block and coarser block parallelism across thread blocks. Because only threads within the same block can cooperate via shared memory and thread synchronization, programmers must partition computation into multiple blocks.
The GPU is viewed as a compute device capable of executing a very high number of threads in parallel. It operates as a coprocessor to the main CPU called host. Data-parallel, compute intensive portions of applications running on the host are transferred to the device by using a function that is executed on the device as many different threads. Both the host and the device maintain their own DRAM, referred to as host memory and device memory, respectively. One can copy data from one DRAM to the other through optimized API calls that utilize the devices high-performance Direct Memory Access (DMA) engines.
The CUDA model is highly parallel as GPGPU model. The approach is to divide the data set into smaller chunks stored in on-chip memory then allows multiple thread processors to share each chunk. Storing the data locally reduces the need to access off-chip memory, thereby improving the performance. Design class of applications that avoid access to off-chip memory in Scientific Computing requires to re-write the application or re-design algorithm. Also, the overheads involved while loading the required off-chip data into local memory, may affect the performance. CUDA handles in an intelligent way in which off-chip memory access usually doesn't stall a thread processor and another thread is ready to execute.
In CUDA, a group of threads work together in round-robin fashion, ensuring that each thread gets execution time without delaying other threads, thereby reducing the thread overheads. The wait for remote access and service strongly factors into a CUDA's efficiency and scaling.
A thread block is a batch of threads that can cooperate together by efficiently sharing data through some -fast shared memory and synchronizing their execution to coordinate memory accesses by specifying synchronization points in the kernel. Its thread ID identifies each thread, which is the thread number within the block. An application can also specify a block as a three-dimensional array and identify each thread using a 3-component index.
The CUDA Toolkit is a complete software development solution for programming CUDA enabled GPUs. The Toolkit includes standard FFT and BLAS libraries, a C-compiler for the NVIDIA GPU and a runtime driver. CUDA technology is currently supported on Linux and Microsoft Windows XP operating systems.
|
CUDA Tool Kit 4.1 for Applications
CUDA Multi-GPU Programming :
CUDA Programming model provides two basic approaches available to execute CUDA kernels on multiple GPUs (CUDA "devices") concurrently from a single host application:
-
Use one host thread per device, since any given host thread can call cudaSetDevice() at most one time.
- Use the push/pop context functions provided by the CUDA Driver API.
Applications that require tight coupling of the various CUDA devices within a sytem, these approaches may
not be sufficient due to synchronization or communication with each other. The CUDA Runtime now provides features
in which single host thread could easily launch work onto any devices it needed. To accommplish this, a host
thread can call
cudaSetDevice()
at any time to change the currently active device. Also, host-thread can now control more than one device.
The CUDA Driver API (Version 4.1) provides a way to access multiple devices from within a single host thread
namely (
cuCtxPushCurrent()
cuCtxPopCurrent()). For convenience
sake, CUDA application developers can use set/get context management interface paradigm and CUDA 4.1 provides
additional features.
With this in mind,
cuCtxSetCurrent())
and
cuCtxGetCurrent())
have been added to version 4.1 of the CUDA Driver API in addition to the existing
cuCtxPushCurrent())
and
cuCtxPopCurrent())
functions.
Programming a multi-GPU application is straight forward and easy from programming an application to
utilize multiple cores or sockets because CUDA is completely orthogonal to CPU thread management
or message passing APIs. Most importantly, selecting the correct GPU, which in most cases is a
free (without a context) GPU is important. Also, identification of compute intensive portion of the existing multi-threaded CPU code and port the code to GPU is easy without changing the inter-CPU-thread communication
code unchanged.
In order to issue work to a GPU, a context is established between a CPU thread (or group of threads) and the GPU. Only one context can be active on a GPU at any particular instant. Similarly, a CPU thread can have one active context at a time. A context is established during the program's first call to a function that changes state (such as
cudaMalloc(),
etc.), so one can force the creation of a context by calling
cudaFree(0).
Note that a context is created on GPU 0 by default, unless another GPU is selected explicitly prior to context creation with a
cudaSetDevice()
call. The context is destroyed either with a
cudaDeviceReset()
call or when the controlling CPU process exits.
MPI, OpenMP, Pthreads on Host CPU (Multi-Core) & Multi-GPU :
In order to issue work to p GPUs concurrently, a program can either use p CPU threads, each with its own
context, or it can use one CPU thread that swaps among several contexts, or some combination thereof.
CPU threads can be lightweight (pthreads,
OpenMP, etc.) or heavyweight (MPI). Note that any CPU multi-threading or message-passing API or library
can be used, as CPU thread management is completely orthogonal to CUDA. For example, one can add GPU
processing to an existing MPI application by porting the compute-intensive portions of the code without
changing the communication structure. For synchronization across computations on GPUs, the host-CPU or GPUDirect
is required for communication.
Even though a GPU can execute calls from one context at a time, it can belong to multiple contexts. For example,
it is possible for several CPU threads to establish separate contexts with the same GPU (though multiple
CPU threads within the same process accessing the same GPU would normally share the same context by default).
The GPU driver manages GPU switching between the contexts, as well as partitioning memory among the contexts
(GPU memory allocated in one context cannot be accessed from another context).
In many applications, the algorithm is designed in such a way that each CPU thread (Pthreads, OpenMP, MPI)
to control a different GPU. Achieving this is straightforward if a program spawns as many lightweight threads as there are
GPUs - one can derive GPU index from thread ID.
For example, OpenMP thread ID can be readily used to select GPUs.
MPI rank can be used to choose a GPU reliably as long as all MPI processes are launched on a single host node
having GPU devices
and host configuration of CUDA programming environment.
|
Unified Virtual Addressing and GPUDirect 2.0 :
CUDA Toolkit 5.0 makes easy of programming on multi-GPU environments for NVIDIA Tesla T20-series (Fermi & Kepler)
GPUs running in 64-bit mode on Linux.
Unified Virtual Addressing (UVA) allows the system memory and the one or more device memories
in a system to share a single virtual address space. This allows the CUDA Driver to determine
the physical memory space to which a particular pointer refers by inspection, which simplifies
the APIs of functions such as cudaMemcpy(), since the application need no longer keep track
of which pointers refer to which memory.
GPUDirect 2.0
Built on top of UVA, GPUDirect v2.0 provides for direct peer-to-peer communication among the multiple devices in a system and for native MPI transfers directly from device memory.
Multi-Threaded Programming :
This has several important ramifications for multi-threaded processes and some of these are given below.
For more detail refer CUDA toolKit 5.0 for Applications
-
Host threads can now share device memory allocations, streams, events, or any other per-context objects (as seen above).
-
Concurrent kernel execution on devices of compute capability 2.x is now possible across host threads, rather than just within a single host thread. Note that this requires the use of separate streams; unless streams are specified, the kernels will be executed sequentially on the device in the order they were launched. In all cases, kernel launch via the
<<<>>>
notation is a thread-safe operation.
-
cudaGetLastError() )
is per-host-thread: it returns the last error returned by an API call in that host thread, even if other host threads are concurrently accessing the same device
|
CUDA Driver API :
In version 4.1, a features in which multiple host threads to set a particular context current simultaneously using either
cuCtxSetCurrent()
or
cuCtxPushCurrent().
For more information refer CUDA Toolkit 5.0 for Applications.
This has several important ramifications for multi-threaded processes:
-
Host threads can now share device memory allocations, streams, events, or any other per-context objects (as seen above).
-
Concurrent kernel execution devices of compute capability 2.x is now possible across host threads, rather than just within a single host thread. Note that this requires the use of separate streams; unless streams are specified, the kernels will be executed sequentially on the device in the order they were launched
|
NVIDIA CUDA TOOLKIT Libraries
-
The CUBLAS library now supports a new API that is thread-safe and allows the application to more easily take advantage of parallelism using streams, especially for functions with scalar return parameters. This new API allows CUBLAS to work cleanly with applications using the new multi-threading features of CUDA Runtime 4.1. The legacy CUBLAS API is still supported, but it is not thread-safe and does not offer as many opportunities for parallelism with streams as the new API.
-
The CURAND library now supports double precision Sobol, scrambled Sobol, log-normal distributions, and a faster setup technique for XORWOW.
-
The CUFFT and CUBLAS library APIs now include functions that will report the library's version number.
-
The CUSPARSE library now provides a solver for triangular sparse linear systems via the
cusparse*csrsv_analysis()
and
cusparse*csrsv_solve()
API functions.
-
The Thrust template library and the NPP image processing library are now bundled with the CUDA Toolkit, with no additional download required.
-
Some API functions in the NPP library were changed to pass results via device pointer instead of via host pointer for consistency with all of the rest of the NPP API.
|
CUDA OpenACC
-
OpenACC is a new open parallel programming standard designed to enable the millions of scientific
and technical Fortran and C programmers to easily take advantage of the transformative power of
heterogeneous CPU/GPU computing systems.
OpenACC provides a compiler directive which identifies the areas of code to accelerate, without
requiring programmers to modify or adapt the underlying code itself. The "high degree of data
parallelism" present in the code is exposed to the compiler, and the directives allow the
compiler to do the detailed work of mapping the computation onto the accelerator.
Directives provide a common code base that is multi-platform and multi-vendor compatible,
offering an ideal way to preserve investment in legacy applications by enabling an easy
migration path to accelerated computing. NVIDIA is positioning its OpenACC as a kind of
high level gateway to its lower level CUDA GPU programming language. Based on the OpenACC
standard, GPU directives are the easy, proven way to accelerate your scientific or industrial code.
NVIDIA's Compute Unified Device Architecture (CUDA) is a soft-ware platform for massively
parallel high-performance computing on the company's powerful GPUs.
With this GPU directives, user
can accelerate their code by simply inserting compiler hints into user's code and the compiler
will automatically map compute-intensive portions of user code to the GPU.
NVIDIA is positioning its OpenACC as a kind of high level gateway to its lower level CUDA
GPU programming language.
The OpenACC Application Program Interface (OpenACC API) which provides portability across
operating systems, host CPUs and GPU accelerators and CUDA APIs will be used in
this Coding Competition.
PGI
Accelerator Compilers with
PGI Directive-based Programming - OpenACC
Directives &
NVIDIA CUDA GPUs (
CUDA SDK/APIs; CUDA-Tuning & Performance; CUDA Toolkit) are used as a computing
platform in hypack-2013 workshop.
The OpenACC API describes a collection of compiler directives
to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from
a host CPU to an attached accelerator, providing portability across operating systems,
host CPUs and accelerators.
For free trial programs from NVIDIA,
PGI
and others, user can use
NVIDIA's "Accelerate Your Scientific Code with OpenACC" Trial, or Portland Group PGI Accelerator Fortran and C99 trial software
or CAPS Enterprise HMPP Workbench Trial Software. Fore more details, visit
NVIDIA OpenACC
PGI :
The Portland Group (PGI) (
http://www.pgroup.com), the
leading independent supplier of compilers and tools for high-performance computing,
provides that availability of the initial release of its
PGI Accelerator Fortran and C compilers with
support for the new OpenACC specification for
directive-based programming of GPUs and accelerators.
A trial version of the beta release of the PGI Accelerator compilers with
support for the OpenACC standard is available for free from the PGI website
at
http://www.pgroup.com/support/downloads.php.
The beta software includes
a restricted use license. The license agreement is available
at
http://www.pgroup.com/support/BTLA
More information on the PGI Accelerator compilers with OpenACC support is
available at
http://www.pgroup.com/accelerate .
More information on the OpenACC
API and standard can be found at:
http://www.openacc.org
Execution Model : The execution model targeted by OpenACC API-enabled compilers is host-directed
execution with an attached accelerator device, such as a GPU. The bulk of a user
application executes on the host. Compute intensive regions are offloaded to the
accelerator device under control of the host. The device executes parallel regions,
which typically contain work-sharing loops, or kernels regions, which typically
contains one or more loops which are executed as kernels. Even in accelerator-targeted
regions, the host must orchestrate the execution by allocating memory on the
accelerator device, initiating data transfer, sending the code to the accelerator,
passing arguments to the parallel region, queuing the device code, waiting for completion,
transferring results back to the host, and deallocating memory.
Memory Model : The most significant difference between a host-only program and a host+accelerator
program is that the memory on the accelerator may be completely separate from host
memory. This is the case with most current GPUs, for example. In this case, the host
may not be able to read or write device memory directly because it is not mapped into
the host's virtual memory space. All data movement between host memory and device memory
must be performed by the host through runtime library calls that explicitly move data
between the separate memories, typically using direct memory access (DMA) transfers.
Programmer must be aware of available memory bandwidth that affects the computer intensity,
limitied device memory and available cache for read-only data.
|
An Overview of OpenACC Directives
-
Direcives facilitate code development for accelerators
-
Provide funcationaliy to initiate startup and shutdown accelator; Manage data or program transfers
beween host(CPU) and accelerator; Manage work between beween host(CPU) and accelerator; Map
required computations on accelerators
|
Some of the Categories of OpenACC APIs are :
- Accelerator Parallel REgion /Kernels Direcives
- Loop Directives
- Data Declaration Directives
- Data Regions Diretives
- Cache Directives
- Runtime Library Routines
- Environment variables
|
C/C++
#Pragma acc directivre-name[clause[, clause]...] new-line
Fortran :
!$acc directivre-name[clause[, clause]...]
c$acc directivre-name[clause[, clause]...]
*$acc directivre-name[clause[, clause]...]
OpenACC Parallel Directive
#pragma acc parallel[clause[, clause]...] new-line
stuctured block
|
The kernel directive defines a region of a program that is to be compiled into
a sequence of kernels for execution on the accelerator. Most importantly, each loop
nest will bea different kernel and kernels are launched in order in device.
When parallel directive is executed, the gangs of worker threads are created to
execute accelerator, one worker in each gang begins executing the code following
the structured block and number of gangs & workers remains constant in parallel
regions.
|
OpenCL - CUDA Enabled NVIDIA GPU
Architecture : The CUDA Architecture is a close to the OpenCL architecture. A CUDA device is build around a
scalable array of multithreaded Streaming Multiprocessor (SMs).
A multiprocessor corresponds to an OpenCL compute unit.
A multiprocessor executes a CUDA thread for each OpenCL work-item and a thread block for
each OpenCL work-group. A kernel is executed over an OpenCL and NDrange by a grid of thread blocks.
Each of the thread blocks that execute kernels is therefore uniquely identified by its work-group ID,
and each thread by its global ID or by a combination of its local ID and work-group ID.
A thread is also given a unique thread ID within its block. When an OpenCL program on the host
invokes a kernel, the work-groups are enumerated and distributed as thread blocks to the multi-processors
with available execution capacity. The threads of thread block execute concurrently on one
Multi-processor. A thread blocks terminate, new blocks are launched on the vacated
multi-processors.
Memory Model : Each multi-processor of NVIDIA CUDA architecture has on-chip memory of the four following types:
-
One set of local 32-bit register per processor,
-
A parallel data cache or share memory that is shared by all scalar processor cores and is where OpenCL local memory resides
-
A read-only constant cache that is shared by all the scalar processors cores and speeds up reads from OpenCL constant memory.
-
A read only texture cache that is shared by all scalar processor cores and speed up reads from OpenCL image objects, each multi-processor cores and speeds up reads from OpenCL image objects, each multi-processor access the texture cache via texture unit that implements the various addressing modes and date filtering specified by OpenCL sampler objects; the region of device memory addressed by image is referred to a texture memory.
There is also a global memory address space that is used for OpenCL global memory and a local memory address space that is private to each thread (and should not be confused with OpenCL local memory). Both memory spaces are read-write regions of device memory and are not cached.
|
List of Programs - OpenCL - CUDA enabled NVIDIA GPUs :
-
CUDA Kernels, Thread Hierarchy, Device Memory, Advantage of Shared Memory, page-locked & pinned
memory,Asynchronous Concurrent Execution,Overlap of Data Transfer and Kernel Execution,
Dynamic partitioning of Shared memory resources, Kernel function -Divergence; Global Memory Bandwidth (Memory access pattern for coalescing).
-
CUDA Streams, Multi-Device System, Warp level Parallelism - CUDA, Data
Prefetching -
Basic Codes : Numerical integration of a function f(x) = 4/(1+x2) between the
limits 0 and 1; Prefix sum of an given array
-
Introduction to NVIDIA-PGI Complier Directives - OpenACC
on GPUs; CUDA enabled NVIDIA GPUs
-
Performance of Matrix Computations - NVIDIA-PGI Complier Directives
OpenACC on GPUs; CUDA enabled NVIDIA GPUs
-
Performance of Application Kernels - NVIDIA-PGI Complier Directives
OpenACC on GPUs; CUDA enabled NVIDIA GPUs
-
Example programs based on The OpenACC Application Program Interface (a collection
of compiler directives and the details are implicit in the programming model and
are managed by the OpenACC API-enabled compilers and runtimes) for matrix
computations on NVIDIA GPUs.
-
Simple example programs on Multi-Core Processors with NVIDIA - GPU Computing CUDA 4.1 SDK.
-
Special example programs using CUDA Tool Chain on Multi-Core
Processors with NVIDIA - GPU Computing CUDA SDK (CULA Tools, CUBLAS, CUFFT, CUSPARSE)
-
Special example programs on matrix computations using Concurrent Asynchronous Execution
APIs of CUDA 4.1 enabled NVIDIA GPUs (single/Multiple devices).
-
Special example programs based on Streams (Concurrent Asynchronous Execution) of CUDA 4.1 of NVIDIA GPU
-
LLVM-based CUDA complier and toolkit technologies
for matrix computation and application kernels;
GPU Accelerator Programming Model - Compiler Optimizations
-
Expousre to NVIDIA Parallel Nsight tool kit.
-
Codes to understand different memory types of
CUDA enabled NVIDIA GPUs for matrix computations.
-
Example programs based on Numerical Linear Algebra using CUDA enabled NVIDIA
GPUS and OpenCL.
-
Example programs (BLAS, FFTs) based on CUDA BLAS Libraries
-
Example programs based on special class of problems- Dense &. Sparse Matrix Computations,
Fast Search Algorithms, & Partial Differential Eqs.(PDEs) will be discussed using
CUDA enabled NVIDIA GPUs &
-
Code Walk through and execution of parallel programs based on mixed
programming environment using using TBB, Pthreads, OpenMP on host Multi-Core systems with
GPU Accelerator devices.
-
Selective example programs on numerical and non-numerical computations
using NVIDIA - GPU Computing CUDA SDK and OpenCL.
-
Example programs based on CUDA APIs
to completely overlap CPU and GPU execution and I/O in HPC GPU Cluster environment.
-
Performance of memory (pinned/locked) & CUDA shared memory usage on CUDA enabled GPUs for application
kernels.
-
Develop test suites to launch multiple kernels on CUDA enabled NVIDIA single & multiple GPU devices.
-
Tuning & Performance using CUDA enabled NVIDIA GPU Libraries; Memory Optimisation, Data-access optimization
for matrix computations
-
Demonstration of Integrated Numerical Linear Algebra Kernels for Matrix Computations
(using Open Source Software)
on CUDA enabled NVIDIA GPUs & OpenCL.
-
Example programs on Heterogeneous Programming - OpenCL based on CUDA
enabled NVIDIA GPUs.
-
Code Walk through and execution of parallel programs based on mixed
programming environment using using TBB, Pthreads, OpenMP on host Multi-Core systems with
GPU Accelerator devices.
-
Implementation of Image Processing applications (Edge Detection, Face Detection &
Image inpainting algorithms) on GPGPUs using
CUDA/OpenCL enabled NVIDIA GPUs and OpenCL of HPC GPU Cluster
-
Implementation of String Search Algorithms -
CUDA/OpenCL enabled NVIDIA GPUs and OpenCL of HPC GPU Cluster
-
Tiled matrix-matrix multiplication, Numerical Linear
Algebra - CUDA; CUDA BLAS Libraries, CUDA SDKs, Implementation of Partial Differential Equations,
Image Processing - Edge Detection Algorithms; String Search Algorithms
-
Example programs,that take advantage of shared memory features of CUDA enabled NVIDIA GPUs
for Dense Matrix computations
-
Example programs,that take advantage of CUDA Streams for Multi-GPU implementation of
Dense matrix computation Kernels
The matrix multiplication examples illustrate the typical data parallel approach used by OpenCL applications
to achieve good performance on GPUs. It illustrates the use of OpenCL local memory that maps to share memory
on the CUDA architecture. Shared memory is much faster than the global memory and implementation based on
shared memory accesses give improvement in performance for typical matrix computations.
Experts may discuss performance guidelines, focusing on Instruction Performance, Memory Bandwidth Issues,
Shared Memory, NDRange & execution time of a kernel launch on the OpenCL implementation, Data transfer
between Host and Device, Warp level synchronization issues, and overall performance optimization strategies.
|
References
1.
|
NVIDIA Kepler Architecture
|
2.
|
NVIDIA CUDA toolkit 5.0 Preview Release April 2012
|
3.
|
NVIDIA Developer Zone
|
4.
|
RDMA for NVIDIA GPUDirect coming in CUDA 5.0 Preview Release, April 2012
|
5.
|
NVIDIA CUDA C Programmig Guide Version 4.2 dated 4/16/2012 (April 2012)
|
6.
|
Dynamic Parallelism in CUDA Tesla K20 Kepler GPUs - Prelease of NVIDIA CUDA 5.0
|
7.
|
NVIDIA Developer ZONE - CUDA Downloads CUDA TOOLKIT 4.2
|
8.
|
NVIDIA Developer ZONE - GPUDirect
|
9.
|
Openacc - NVIDIA
|
10.
|
Nsight, Eclipse Edition Pre-release of CUDA 5.0, April 2012
|
11.
|
NVIDIA OpenCL Programming Guide for the CUDA Architecture version 4.0 Feb, 2011 (2/14,2011)
|
12.
|
Optmization : NVIDIA OpenCL Best Practices Guide Version 1.0 Feb 2011
|
13.
|
NVIDIA OpenCL JumpStart Guide - Technical Brief
|
14.
|
NVIDIA CUDA C BEST PRACTICES GUIDE (Design Guide) V4.0, May 2011
|
15.
|
NVIDIA CUDA C Programming Guide Version V4.0, May 2011 (5/6/2011)
|
16.
|
NVIDIA GPU Computing SDK
|
17.
|
Apple : Snowleopard - OpenCL
|
18.
|
The OpenCL Specification, Version 1.1, Published by Khronos OpenCL
Working Group, Aaftab Munshi (ed.), 2010.
|
19.
|
The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group
|
20.
|
Khronos V1.0 Introduction and Overview, June 2010
|
21.
|
The OpenCL 1.1 Quick Reference card.
|
22.
|
OpenCL 1.1 Specification (Revision 44) June 1, 2011
|
23.
|
The OpenCL 1.1 Specification (Document Revision 44) Last Revision Date : 6/1/11
Editor : Aaftab Munshi Khronos OpenCL Working Group
|
24.
|
OpenCL Reference Pages
|
25.
|
MATLAB
|
26.
|
NVIDIA - CUDA MATLAB Acceleration
|
27.
|
Jason Sanders, Edward Kandrot (Foreword by Jack Dongarra)
CUDA BY EXAMPLE - An Introduction to General Purpose GPU Programnming,
Addison Wessely 2011, nvidia
|
28.
|
Programming Massively Parallel Processors A Hands-on Approach -
David B Kirk, Wen-mei W.
David B Kirk, Wen-mei W. Hwu
nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011
|
29.
|
OpenCL Toolbox for MATLAB
|
30.
|
NAG
|
|
|
|
|