• Mode-4 GPGPUs • NVIDIA - CUDA/OpenCL • AMD APP OpenCL • GPGPUs - OpenCL • GPGPUs : Power & Perf. • Home




hyPACK-2013 Mode-4: GPU Computing with CUDA enabled NVIDIA GPUs

It is well-known that the computational power of GPUs has widely attracted the scientific community and GPUs provide unprecedented computational power to solve the data intensive applications. The use of the graphical Processing Unit (GPU) to accelerate non-graphics computations has drawn much attention. This is due to the fact that the computational power of GPUs has exceeded that of PC-based CPUs by more than one order of magnitude while being available for a comparable price. CUDA 5.0 is used for development of programs in the lab. sessions and tuning & optimisation techniques are employed to extract the performance of application kernels.


Click here ...... to know more about CUDA GPU computing/Codes


CUDA - NVIDIA GPU Prog. Overview The NVIDIA 's Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the company's powerful GPUs. The NVIDIA CUDA technology is a fundamentally new computing architecture that enables the GPU to solve complex computational problems.

CUDA technology gives computationally intensive applications access to the processing power of NVIDIA graphics processing units (GPUs) through a new, programming interface. CUDA is a software platform for massively parallel high-performance computing on the NVIDIA's powerful GPUs. The game community has been using the NVIDIA's GPUs and graphics cards (NVIDIA's GeForce, Quadrobrand and Tesla, Fermi brand products) since long time.

CUDA requires programmers to write special code for parallel processing but it doesn't require them to explicitly manage threads, which simplifies the programming model. CUDA includes C/C++ Software development tools, functions libraries and a hardware abstraction mechanism that hides the GPU hardware from developers. Selective Scientific and Engineering applications, which come, fall in the category of Data intensive as well as embarrassingly parallel and Consumer market applications (Gaming, Video) may require single precision floating point mathematical operations. CUDA provides solution for such applications and NVIDIA's new GPU which supports double precision floating point mathematical operations can address broader class of applications. The NVIDIA Tesla cards are becoming popular in high-performance computing applications.



CUDA Programming Model

CUDA Programming model automatically manages the threads and it is significantly differs from single threaded CPU code and to some extent even the parallel code. Before availability of NVIDIA's CUDA, some of the users in Parallel Processing Community write codes for GPU. Efficient CUDA programs exploit both thread parallelism within a thread block and coarser block parallelism across thread blocks. Because only threads within the same block can cooperate via shared memory and thread synchronization, programmers must partition computation into multiple blocks. The GPU is viewed as a compute device capable of executing a very high number of threads in parallel. It operates as a coprocessor to the main CPU called host. Data-parallel, compute intensive portions of applications running on the host are transferred to the device by using a function that is executed on the device as many different threads. Both the host and the device maintain their own DRAM, referred to as host memory and device memory, respectively. One can copy data from one DRAM to the other through optimized API calls that utilize the devices high-performance Direct Memory Access (DMA) engines.

The CUDA model is highly parallel as GPGPU model. The approach is to divide the data set into smaller chunks stored in on-chip memory then allows multiple thread processors to share each chunk. Storing the data locally reduces the need to access off-chip memory, thereby improving the performance. Design class of applications that avoid access to off-chip memory in Scientific Computing requires to re-write the application or re-design algorithm. Also, the overheads involved while loading the required off-chip data into local memory, may affect the performance. CUDA handles in an intelligent way in which off-chip memory access usually doesn't stall a thread processor and another thread is ready to execute.

In CUDA, a group of threads work together in round-robin fashion, ensuring that each thread gets execution time without delaying other threads, thereby reducing the thread overheads. The wait for remote access and service strongly factors into a CUDA's efficiency and scaling. A thread block is a batch of threads that can cooperate together by efficiently sharing data through some -fast shared memory and synchronizing their execution to coordinate memory accesses by specifying synchronization points in the kernel. Its thread ID identifies each thread, which is the thread number within the block. An application can also specify a block as a three-dimensional array and identify each thread using a 3-component index.

The CUDA Toolkit is a complete software development solution for programming CUDA enabled GPUs. The Toolkit includes standard FFT and BLAS libraries, a C-compiler for the NVIDIA GPU and a runtime driver. CUDA technology is currently supported on Linux and Microsoft Windows XP operating systems.



CUDA Tool Kit 4.1 for Applications

CUDA Multi-GPU Programming : CUDA Programming model provides two basic approaches available to execute CUDA kernels on multiple GPUs (CUDA "devices") concurrently from a single host application:

  • Use one host thread per device, since any given host thread can call cudaSetDevice() at most one time.
  • Use the push/pop context functions provided by the CUDA Driver API.

Applications that require tight coupling of the various CUDA devices within a sytem, these approaches may not be sufficient due to synchronization or communication with each other. The CUDA Runtime now provides features in which single host thread could easily launch work onto any devices it needed. To accommplish this, a host thread can call cudaSetDevice() at any time to change the currently active device. Also, host-thread can now control more than one device. The CUDA Driver API (Version 4.1) provides a way to access multiple devices from within a single host thread namely ( cuCtxPushCurrent() cuCtxPopCurrent()). For convenience sake, CUDA application developers can use set/get context management interface paradigm and CUDA 4.1 provides additional features. With this in mind, cuCtxSetCurrent()) and cuCtxGetCurrent()) have been added to version 4.1 of the CUDA Driver API in addition to the existing cuCtxPushCurrent()) and cuCtxPopCurrent()) functions.

Programming a multi-GPU application is straight forward and easy from programming an application to utilize multiple cores or sockets because CUDA is completely orthogonal to CPU thread management or message passing APIs. Most importantly, selecting the correct GPU, which in most cases is a free (without a context) GPU is important. Also, identification of compute intensive portion of the existing multi-threaded CPU code and port the code to GPU is easy without changing the inter-CPU-thread communication code unchanged.

In order to issue work to a GPU, a context is established between a CPU thread (or group of threads) and the GPU. Only one context can be active on a GPU at any particular instant. Similarly, a CPU thread can have one active context at a time. A context is established during the program's first call to a function that changes state (such as cudaMalloc(), etc.), so one can force the creation of a context by calling cudaFree(0). Note that a context is created on GPU 0 by default, unless another GPU is selected explicitly prior to context creation with a cudaSetDevice() call. The context is destroyed either with a cudaDeviceReset() call or when the controlling CPU process exits.

MPI, OpenMP, Pthreads on Host CPU (Multi-Core) & Multi-GPU : In order to issue work to p GPUs concurrently, a program can either use p CPU threads, each with its own context, or it can use one CPU thread that swaps among several contexts, or some combination thereof. CPU threads can be lightweight (pthreads, OpenMP, etc.) or heavyweight (MPI). Note that any CPU multi-threading or message-passing API or library can be used, as CPU thread management is completely orthogonal to CUDA. For example, one can add GPU processing to an existing MPI application by porting the compute-intensive portions of the code without changing the communication structure. For synchronization across computations on GPUs, the host-CPU or GPUDirect is required for communication.

Even though a GPU can execute calls from one context at a time, it can belong to multiple contexts. For example, it is possible for several CPU threads to establish separate contexts with the same GPU (though multiple CPU threads within the same process accessing the same GPU would normally share the same context by default). The GPU driver manages GPU switching between the contexts, as well as partitioning memory among the contexts (GPU memory allocated in one context cannot be accessed from another context).

In many applications, the algorithm is designed in such a way that each CPU thread (Pthreads, OpenMP, MPI) to control a different GPU. Achieving this is straightforward if a program spawns as many lightweight threads as there are GPUs - one can derive GPU index from thread ID. For example, OpenMP thread ID can be readily used to select GPUs. MPI rank can be used to choose a GPU reliably as long as all MPI processes are launched on a single host node having GPU devices and host configuration of CUDA programming environment.




Unified Virtual Addressing and GPUDirect 2.0 : CUDA Toolkit 5.0 makes easy of programming on multi-GPU environments for NVIDIA Tesla T20-series (Fermi & Kepler) GPUs running in 64-bit mode on Linux. Unified Virtual Addressing (UVA) allows the system memory and the one or more device memories in a system to share a single virtual address space. This allows the CUDA Driver to determine the physical memory space to which a particular pointer refers by inspection, which simplifies the APIs of functions such as cudaMemcpy(), since the application need no longer keep track of which pointers refer to which memory.


GPUDirect 2.0 Built on top of UVA, GPUDirect v2.0 provides for direct peer-to-peer communication among the multiple devices in a system and for native MPI transfers directly from device memory.

Multi-Threaded Programming : This has several important ramifications for multi-threaded processes and some of these are given below. For more detail refer CUDA toolKit 5.0 for Applications

  • Host threads can now share device memory allocations, streams, events, or any other per-context objects (as seen above).

  • Concurrent kernel execution on devices of compute capability 2.x is now possible across host threads, rather than just within a single host thread. Note that this requires the use of separate streams; unless streams are specified, the kernels will be executed sequentially on the device in the order they were launched. In all cases, kernel launch via the <<<>>> notation is a thread-safe operation.

  • cudaGetLastError() ) is per-host-thread: it returns the last error returned by an API call in that host thread, even if other host threads are concurrently accessing the same device


CUDA Driver API :

In version 4.1, a features in which multiple host threads to set a particular context current simultaneously using either cuCtxSetCurrent() or cuCtxPushCurrent(). For more information refer CUDA Toolkit 5.0 for Applications. This has several important ramifications for multi-threaded processes:

  • Host threads can now share device memory allocations, streams, events, or any other per-context objects (as seen above).

  • Concurrent kernel execution devices of compute capability 2.x is now possible across host threads, rather than just within a single host thread. Note that this requires the use of separate streams; unless streams are specified, the kernels will be executed sequentially on the device in the order they were launched




NVIDIA CUDA TOOLKIT Libraries

  • The CUBLAS library now supports a new API that is thread-safe and allows the application to more easily take advantage of parallelism using streams, especially for functions with scalar return parameters. This new API allows CUBLAS to work cleanly with applications using the new multi-threading features of CUDA Runtime 4.1. The legacy CUBLAS API is still supported, but it is not thread-safe and does not offer as many opportunities for parallelism with streams as the new API.

  • The CURAND library now supports double precision Sobol, scrambled Sobol, log-normal distributions, and a faster setup technique for XORWOW.

  • The CUFFT and CUBLAS library APIs now include functions that will report the library's version number.

  • The CUSPARSE library now provides a solver for triangular sparse linear systems via the cusparse*csrsv_analysis() and cusparse*csrsv_solve() API functions.

  • The Thrust template library and the NPP image processing library are now bundled with the CUDA Toolkit, with no additional download required.

  • Some API functions in the NPP library were changed to pass results via device pointer instead of via host pointer for consistency with all of the rest of the NPP API.




CUDA OpenACC

  • OpenACC is a new open parallel programming standard designed to enable the millions of scientific and technical Fortran and C programmers to easily take advantage of the transformative power of heterogeneous CPU/GPU computing systems. OpenACC provides a compiler directive which identifies the areas of code to accelerate, without requiring programmers to modify or adapt the underlying code itself. The "high degree of data parallelism" present in the code is exposed to the compiler, and the directives allow the compiler to do the detailed work of mapping the computation onto the accelerator.

    Directives provide a common code base that is multi-platform and multi-vendor compatible, offering an ideal way to preserve investment in legacy applications by enabling an easy migration path to accelerated computing. NVIDIA is positioning its OpenACC as a kind of high level gateway to its lower level CUDA GPU programming language. Based on the OpenACC standard, GPU directives are the easy, proven way to accelerate your scientific or industrial code.

    NVIDIA's Compute Unified Device Architecture (CUDA) is a soft-ware platform for massively parallel high-performance computing on the company's powerful GPUs. With this GPU directives, user can accelerate their code by simply inserting compiler hints into user's code and the compiler will automatically map compute-intensive portions of user code to the GPU. NVIDIA is positioning its OpenACC as a kind of high level gateway to its lower level CUDA GPU programming language. The OpenACC Application Program Interface (OpenACC API) which provides portability across operating systems, host CPUs and GPU accelerators and CUDA APIs will be used in this Coding Competition. PGI Accelerator Compilers with PGI Directive-based Programming - OpenACC Directives & NVIDIA CUDA GPUs ( CUDA SDK/APIs; CUDA-Tuning & Performance; CUDA Toolkit) are used as a computing platform in hypack-2013 workshop.

    The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators. For free trial programs from NVIDIA, PGI and others, user can use NVIDIA's "Accelerate Your Scientific Code with OpenACC" Trial, or Portland Group PGI Accelerator Fortran and C99 trial software or CAPS Enterprise HMPP Workbench Trial Software. Fore more details, visit NVIDIA OpenACC

    PGI :

    The Portland Group (PGI) ( http://www.pgroup.com), the leading independent supplier of compilers and tools for high-performance computing, provides that availability of the initial release of its PGI Accelerator Fortran and C compilers with support for the new OpenACC specification for directive-based programming of GPUs and accelerators. A trial version of the beta release of the PGI Accelerator compilers with support for the OpenACC standard is available for free from the PGI website at http://www.pgroup.com/support/downloads.php. The beta software includes a restricted use license. The license agreement is available at http://www.pgroup.com/support/BTLA More information on the PGI Accelerator compilers with OpenACC support is available at http://www.pgroup.com/accelerate . More information on the OpenACC API and standard can be found at: http://www.openacc.org

    Execution Model : The execution model targeted by OpenACC API-enabled compilers is host-directed execution with an attached accelerator device, such as a GPU. The bulk of a user application executes on the host. Compute intensive regions are offloaded to the accelerator device under control of the host. The device executes parallel regions, which typically contain work-sharing loops, or kernels regions, which typically contains one or more loops which are executed as kernels. Even in accelerator-targeted regions, the host must orchestrate the execution by allocating memory on the accelerator device, initiating data transfer, sending the code to the accelerator, passing arguments to the parallel region, queuing the device code, waiting for completion, transferring results back to the host, and deallocating memory.

    Memory Model : The most significant difference between a host-only program and a host+accelerator program is that the memory on the accelerator may be completely separate from host memory. This is the case with most current GPUs, for example. In this case, the host may not be able to read or write device memory directly because it is not mapped into the host's virtual memory space. All data movement between host memory and device memory must be performed by the host through runtime library calls that explicitly move data between the separate memories, typically using direct memory access (DMA) transfers. Programmer must be aware of available memory bandwidth that affects the computer intensity, limitied device memory and available cache for read-only data.



An Overview of OpenACC Directives

  • Direcives facilitate code development for accelerators
  • Provide funcationaliy to initiate startup and shutdown accelator; Manage data or program transfers beween host(CPU) and accelerator; Manage work between beween host(CPU) and accelerator; Map required computations on accelerators

Some of the Categories of OpenACC APIs are :
  • Accelerator Parallel REgion /Kernels Direcives
  • Loop Directives
  • Data Declaration Directives
  • Data Regions Diretives
  • Cache Directives
  • Runtime Library Routines
  • Environment variables

C/C++
     #Pragma acc directivre-name[clause[, clause]...] new-line

Fortran :

     !$acc directivre-name[clause[, clause]...]

     c$acc directivre-name[clause[, clause]...]

     *$acc directivre-name[clause[, clause]...]


OpenACC Parallel Directive

#pragma acc parallel[clause[, clause]...] new-line
            stuctured block


The kernel directive defines a region of a program that is to be compiled into a sequence of kernels for execution on the accelerator. Most importantly, each loop nest will bea different kernel and kernels are launched in order in device. When parallel directive is executed, the gangs of worker threads are created to execute accelerator, one worker in each gang begins executing the code following the structured block and number of gangs & workers remains constant in parallel regions.


OpenCL - CUDA Enabled NVIDIA GPU

Architecture : The CUDA Architecture is a close to the OpenCL architecture. A CUDA device is build around a scalable array of multithreaded Streaming Multiprocessor (SMs). A multiprocessor corresponds to an OpenCL compute unit. A multiprocessor executes a CUDA thread for each OpenCL work-item and a thread block for each OpenCL work-group. A kernel is executed over an OpenCL and NDrange by a grid of thread blocks. Each of the thread blocks that execute kernels is therefore uniquely identified by its work-group ID, and each thread by its global ID or by a combination of its local ID and work-group ID. A thread is also given a unique thread ID within its block. When an OpenCL program on the host invokes a kernel, the work-groups are enumerated and distributed as thread blocks to the multi-processors with available execution capacity. The threads of thread block execute concurrently on one Multi-processor. A thread blocks terminate, new blocks are launched on the vacated multi-processors.

Memory Model : Each multi-processor of NVIDIA CUDA architecture has on-chip memory of the four following types:

  • One set of local 32-bit register per processor,

  • A parallel data cache or share memory that is shared by all scalar processor cores and is where OpenCL local memory resides

  • A read-only constant cache that is shared by all the scalar processors cores and speeds up reads from OpenCL constant memory.

  • A read only texture cache that is shared by all scalar processor cores and speed up reads from OpenCL image objects, each multi-processor cores and speeds up reads from OpenCL image objects, each multi-processor access the texture cache via texture unit that implements the various addressing modes and date filtering specified by OpenCL sampler objects; the region of device memory addressed by image is referred to a texture memory.

There is also a global memory address space that is used for OpenCL global memory and a local memory address space that is private to each thread (and should not be confused with OpenCL local memory). Both memory spaces are read-write regions of device memory and are not cached.



List of Programs - OpenCL - CUDA enabled NVIDIA GPUs :

  • CUDA Kernels, Thread Hierarchy, Device Memory, Advantage of Shared Memory, page-locked & pinned memory,Asynchronous Concurrent Execution,Overlap of Data Transfer and Kernel Execution, Dynamic partitioning of Shared memory resources, Kernel function -Divergence; Global Memory Bandwidth (Memory access pattern for coalescing).

  • CUDA Streams, Multi-Device System, Warp level Parallelism - CUDA, Data Prefetching -

  • Basic Codes : Numerical integration of a function f(x) = 4/(1+x2) between the limits 0 and 1; Prefix sum of an given array

  • Introduction to NVIDIA-PGI Complier Directives - OpenACC on GPUs; CUDA enabled NVIDIA GPUs

  • Performance of Matrix Computations - NVIDIA-PGI Complier Directives OpenACC on GPUs; CUDA enabled NVIDIA GPUs

  • Performance of Application Kernels - NVIDIA-PGI Complier Directives OpenACC on GPUs; CUDA enabled NVIDIA GPUs

  • Example programs based on The OpenACC Application Program Interface (a collection of compiler directives and the details are implicit in the programming model and are managed by the OpenACC API-enabled compilers and runtimes) for matrix computations on NVIDIA GPUs.

  • Simple example programs on Multi-Core Processors with NVIDIA - GPU Computing CUDA 4.1 SDK.

  • Special example programs using CUDA Tool Chain on Multi-Core Processors with NVIDIA - GPU Computing CUDA SDK (CULA Tools, CUBLAS, CUFFT, CUSPARSE)

  • Special example programs on matrix computations using Concurrent Asynchronous Execution APIs of CUDA 4.1 enabled NVIDIA GPUs (single/Multiple devices).

  • Special example programs based on Streams (Concurrent Asynchronous Execution) of CUDA 4.1 of NVIDIA GPU

  • LLVM-based CUDA complier and toolkit technologies for matrix computation and application kernels; GPU Accelerator Programming Model - Compiler Optimizations

  • Expousre to NVIDIA Parallel Nsight tool kit.

  • Codes to understand different memory types of CUDA enabled NVIDIA GPUs for matrix computations.

  • Example programs based on Numerical Linear Algebra using CUDA enabled NVIDIA GPUS and OpenCL.

  • Example programs (BLAS, FFTs) based on CUDA BLAS Libraries

  • Example programs based on special class of problems- Dense &. Sparse Matrix Computations, Fast Search Algorithms, & Partial Differential Eqs.(PDEs) will be discussed using CUDA enabled NVIDIA GPUs &

  • Code Walk through and execution of parallel programs based on mixed programming environment using using TBB, Pthreads, OpenMP on host Multi-Core systems with GPU Accelerator devices.

  • Selective example programs on numerical and non-numerical computations using NVIDIA - GPU Computing CUDA SDK and OpenCL.

  • Example programs based on CUDA APIs to completely overlap CPU and GPU execution and I/O in HPC GPU Cluster environment.

  • Performance of memory (pinned/locked) & CUDA shared memory usage on CUDA enabled GPUs for application kernels.

  • Develop test suites to launch multiple kernels on CUDA enabled NVIDIA single & multiple GPU devices.

  • Tuning & Performance using CUDA enabled NVIDIA GPU Libraries; Memory Optimisation, Data-access optimization for matrix computations

  • Demonstration of Integrated Numerical Linear Algebra Kernels for Matrix Computations (using Open Source Software) on CUDA enabled NVIDIA GPUs & OpenCL.

  • Example programs on Heterogeneous Programming - OpenCL based on CUDA enabled NVIDIA GPUs.

  • Code Walk through and execution of parallel programs based on mixed programming environment using using TBB, Pthreads, OpenMP on host Multi-Core systems with GPU Accelerator devices.

  • Implementation of Image Processing applications (Edge Detection, Face Detection & Image inpainting algorithms) on GPGPUs using CUDA/OpenCL enabled NVIDIA GPUs and OpenCL of HPC GPU Cluster

  • Implementation of String Search Algorithms - CUDA/OpenCL enabled NVIDIA GPUs and OpenCL of HPC GPU Cluster

  • Tiled matrix-matrix multiplication, Numerical Linear Algebra - CUDA; CUDA BLAS Libraries, CUDA SDKs, Implementation of Partial Differential Equations, Image Processing - Edge Detection Algorithms; String Search Algorithms

  • Example programs,that take advantage of shared memory features of CUDA enabled NVIDIA GPUs for Dense Matrix computations

  • Example programs,that take advantage of CUDA Streams for Multi-GPU implementation of Dense matrix computation Kernels

The matrix multiplication examples illustrate the typical data parallel approach used by OpenCL applications to achieve good performance on GPUs. It illustrates the use of OpenCL local memory that maps to share memory on the CUDA architecture. Shared memory is much faster than the global memory and implementation based on shared memory accesses give improvement in performance for typical matrix computations.

Experts may discuss performance guidelines, focusing on Instruction Performance, Memory Bandwidth Issues, Shared Memory, NDRange & execution time of a kernel launch on the OpenCL implementation, Data transfer between Host and Device, Warp level synchronization issues, and overall performance optimization strategies.



References

1. NVIDIA Kepler Architecture
2. NVIDIA CUDA toolkit 5.0 Preview Release April 2012
3. NVIDIA Developer Zone
4. RDMA for NVIDIA GPUDirect coming in CUDA 5.0 Preview Release, April 2012
5. NVIDIA CUDA C Programmig Guide Version 4.2 dated 4/16/2012 (April 2012)
6. Dynamic Parallelism in CUDA Tesla K20 Kepler GPUs - Prelease of NVIDIA CUDA 5.0
7. NVIDIA Developer ZONE - CUDA Downloads CUDA TOOLKIT 4.2
8. NVIDIA Developer ZONE - GPUDirect
9. Openacc - NVIDIA
10. Nsight, Eclipse Edition Pre-release of CUDA 5.0, April 2012
11. NVIDIA OpenCL Programming Guide for the CUDA Architecture version 4.0 Feb, 2011 (2/14,2011)
12. Optmization : NVIDIA OpenCL Best Practices Guide Version 1.0 Feb 2011
13. NVIDIA OpenCL JumpStart Guide - Technical Brief
14. NVIDIA CUDA C BEST PRACTICES GUIDE (Design Guide) V4.0, May 2011
15. NVIDIA CUDA C Programming Guide Version V4.0, May 2011 (5/6/2011)
16. NVIDIA GPU Computing SDK
17. Apple : Snowleopard - OpenCL
18. The OpenCL Specification, Version 1.1, Published by Khronos OpenCL Working Group, Aaftab Munshi (ed.), 2010.
19. The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group
20. Khronos V1.0 Introduction and Overview, June 2010
21. The OpenCL 1.1 Quick Reference card.
22. OpenCL 1.1 Specification (Revision 44) June 1, 2011
23. The OpenCL 1.1 Specification (Document Revision 44) Last Revision Date : 6/1/11 Editor : Aaftab Munshi Khronos OpenCL Working Group
24. OpenCL Reference Pages
25. MATLAB
26. NVIDIA - CUDA MATLAB Acceleration
27. Jason Sanders, Edward Kandrot (Foreword by Jack Dongarra) CUDA BY EXAMPLE - An Introduction to General Purpose GPU Programnming, Addison Wessely 2011, nvidia
28. Programming Massively Parallel Processors A Hands-on Approach - David B Kirk, Wen-mei W. David B Kirk, Wen-mei W. Hwu nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011
29. OpenCL Toolbox for MATLAB
30. NAG
Centre for Development of Advanced Computing