• Mode-4 GPGPUs • NVIDIA - CUDA/OpenCL • AMD APP OpenCL • GPGPUs - OpenCL • GPGPUs : Power & Perf. • Home




hypack-2013 Mode-2 : GPGPU AMD-APP (APU) SDK

AMD Accelerated Parallel Processing (AMD APP) SOftware harnesses the tremendous processing power of GPUs for high-performance, data-parallel computing in a wide range of applications. The AMD Accelerated Parallel Processing system includes a software stack and the AMD GPUs. Please refer to AMD-APP Accelerated Parallel Processing (AMD APP) Programming Guide OpenCL to understand the relationship of the AMD Accelerated Parallel Processing components.

The AMD APP software stack provides end-users and developers with a complete, flexible suite of tools to leverage the processin power in AMD GPUs. AMD-APP OpenCL software development platform for x86-based CPUs and it provides complete heterogeneous OpenCL development platform for both the CPU and GPU. The software includes OpenCL compiler & runtime, Device Driver for GPU compute device - AMD Compute Abstraction Layer (CAL), Performance Profiling Tools - AMD APP Profiler and AMD APP KernelAnalyzer and Performance Libraries - AMD Core Math Library (ACML).


AMD APP SDK : OpenCL     CAL     List of Programs - OpenCL

List of Programs OpenCL - AMD APP

Module 1 : Getting Started : Basics - OpenCL
Module 2 : OpenCL Programs on Matrix Computations
Module 3 : OpenCL Programs using BLAS libraries for Matrix Computations
Module 4 : OpenCL Programs - Application Kernels
Module 5 : OpenCL Memory Optimization Programs - Tuning & Performance


References & Web-Pages : GPGPU & GPU Computing       Web-sites


GPGPU-History

In the recent years, much attention has been gained for general purpose CPU (GPGPU) processing. The word "general purpose" in the context of High Performance Computing (HPC) usually means "data intensive applications in scientific and engineering fields. In GPGPU (Graphics) Processing, the graphics performance of specialized software, e.g. scientific software, image manipulation, video decoders/encoders, games that make GPU performance pretty important.

The speed at which the data can be sent to the GPGPUs, internally processed and the results sent back is as important as the processing power of the GPGPUs. Also, the performance of Video (GFX) Rendering in which how efficiently graphics processors can handle rendering. Such operations are used by all graphics software, image manipulation, video decoders/encoders, games and modern operating systems. Video (GFX) Memory is crucial for performance, the bandwidth of the memory of the video adapters (GFXs) and the bandwidth of the bus drive the performance.

In these programming techniques, programmers can use GPU's pixel shavers as general-purpose single precision FPUs, For typical Video applications, GPGPU processing is highly parallel, but it relies on the size of off-chip video memory to operate on large data sets. Off-chip memory on GPGPUs plays an important role for GPGPU applications in which different threads must interact each other through off chip memory. From graphics point of view, the video memory, normally used for texture maps and so forth in graphics applications, may store any kind of data in GPGPCU applications. Video (GFX) Memory is crucial for performance and the bandwidth of the memory of the video adapters (GFXs) and the bandwidth of the bus that connects them to your computer drive the performance.

In GPGPU (Graphics) Processing, the graphics performance of specialized software, e.g. scientific software, image manipulation, video decoders/encoders, games that make GPU performance pretty important. Also, the performance of Graphics (GPGPU) bandwidth i.e. the bandwidth of the memory of the graphics processors (GPGPUs) and the bandwidth of the bus that connects them to your computer. The speed at which the data can be sent to the GPGPUs, internally processed and the results sent back is as important as the processing power of the GPGPUs. Also, the performance of Video (GFX) Rendering in which how efficiently graphics processors can handle rendering. Such operations are used by all graphics software, image manipulation, video decoders/encoders, games and modern operating systems. Video (GFX) Memory is crucial for performance of applications.

The nVIDA CUDA & AMD-APP model is highly parallel as GPGPU model. The approach is to divide the data set into smaller chunks stored in on-chip memory then allows multiple thread processors to share each chunk. Storing the data locally reduces the need to access off-chip memory, thereby improving the performance.

The GPU is viewed as a compute device capable of executing a very high number of threads in parallel. It operates as a coprocessor to the main CPU called host. Data-parallel, compute intensive portions of applications running on the host are transferred to the device by using a function that is executed on the device as many different threads. Both the host and the device maintain their own DRAM, referred to as host memory and device memory, respectively. One can copy data from one DRAM to the other through optimized API calls that utilize the devices high-performance Direct Memory Access (DMA) engines.


GPGPU : User's View
  • High-performance computing on GPUs has attracted in the academic community as well as within the industry, so there is growing expertise among programmers and alternative approaches to software development.

  • GPUs became fully programmable devices - the shaders used to be hard-wired. Their programming interfaces, tools, and support of double precision are required. Single-precision floating is sufficient for consumer graphics.

  • Developers of high performance computing for heterogeneous applications right now must choose a nonstandard approach in GPU technology and nVIDIA CUDA is a only one option among many other software programming paradigms.

  • Other Multi-core Development Platform (RapdMind) supports multiple processor architectures, including Nvidia's GPUs, ATI's, GPUs, IBM's Cell BE, and Intel's and AMD's x86. This flexibility lets developers target the architecture currently delivering the best performance - without rewriting or even recompiling the source code.

  • Developers will probably think seriously to re-write the new code for a particular development platform. They prefer an architecture-independent, industry-standard solution widely supported by tool vendors.

  • Many companies (AMD, IBM, Intel, Microsoft, and other companies) are working in the direction of standard parallel-processing extensions to C/C++ and their efforts may take more time.

GPGPU Programming Environment


Graphic Processing units (GPUs - AMD, NVIDIA), and the Cell Broadband Engine (Cell BE) processor by Sony, Toshiba, and the IBM, have demonstrated tremendous performance improvements employing scalable parallel processors architecture. In the past, the RapdMind Development Platform ( http://www.rapindind.net ) was emerged for programming on Multi Cores. Without changing the application logic, if multi core processors with accelerators available, using the RapidMind, an additional speedup can be archived. The RapidMind platform will automatically manage movement of data and computation between the accelerator and the host.

RapidMind Programming Environment : RapidMind is a development and runtime platform that enables single threaded, manageable applications to fully access multi-core processors. The RapidMind Development Platform is a framework for expressing data-parallel computations from within C++ and executing them efficiently on multicore processors. The RapidMind Multi-core Development Platform provides a simple single-source mechanism to develop portable high-performance applications for multicore processors. The computation on multi cores within existing C++ applications can be carried out without much changes. The RapidMind Platform provides a set of backends. Each provides services that support the execution of RapidMind programs on a particular processor. The developer does not have to deal with the details of each processor, and is free to write portable applications that work on a variety of processor targets.

  • The x86 backend executes RapidMind programs on x86 CPUs from Intel and AMD
  • The GPU backend executes RapidMind programs on a variety of Graphics Processing Units (GPUs) from both AMD-ATI and NVIDIA
  • The Cell BE backend executes RapidMind programs on the SPEs of the Cell BE Broadband Engine
  • The Debug backend executes RapidMind programs on the host processor, compiling programs with a C compiler

The RapidMind implementation on GPU and CPU core for Fast Fourier Transform (FFT) and the Basic Linear Algebra Subroutines (BLAS) and Single Precision matrix-multiply (SGEMM) showed good performance in comparison with the same algorithm running on a CPU Core. In particular, RapidMind can be used to develop applications that fully exploit the power of the Cell Broadband Engine (Cell/B.E.) processor's unique architecture by writing only one, single-threaded C++ program using an existing C++ compiler. Applications such as real-time ray tracing for the automotive, and a real-time medical imaging reconstruction have been demonstrated with strategic partner AMD at the 2008 SIGGRAPH Conference & Exhibition.

GPU - Standford Brook :


Brook for GPUs is a compiler and runtime implementation of the Brook stream program language for modern graphics hardware. Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar and efficient language. Brook started as an open-source project from Stanford University for demonstrating general-purpose data-parallel computations on graphics processors.

  • Data Parallelism: Allows the programmer to specify how to perform the same operations in parallel on different data.
  • Arithmetic Intensity: Encourages programmers to specify operations on data, which minimize global communication and maximize localized computation.

In Data parallelism each fragment of data shaded independently, which leads to better ALU uses and hide memory latency. In comparison with CPU, GPU are having number of hardware thread, where CPU is having one or two stream of execution. Brook was a set of extensions to the C language - "C with streams," which exposes the hardware, Graphics Processing Unit for general purpose computing and parallel computing world. Brook Can be compiled for Window , Linus and for MacOS , with DirectX v9+ , OpenGL, MacOS X. The following is the architectural detail of Brook Compiler.

The Brook project is aimed to demonstrate general-purpose programming on GPUs and research efforts on the stream language programing model, and streaming applications.Brook make programming gpus easier and hides all complexities like data management, graphics based construct in CG/HLSL, rendering process. It almost virtualizes the resources and exposes it as if an extension to cpu.

ATI & AMD GPGPU - Stream Computing

ATI, which is acquired by AMD in 2006, developed one of the best graphics processing technology. ATI graphics processing technology provides the server market with cost efficient and reliable products. Graphics stability, video quality, bus architectures and software support play an important role for all elements of a winning combination in server market. Provide outstanding stability in display environments and maximum flexibility across multiple applications in graphics environment plays a crucial role for computing.

The development and evolution of parallel rendering middleware is necessary for large-scale real-time applications with visualization features. This middleware application transparent toolkits (Chromium and SGI OpenGL Multipipe) as well as application programming interfaces (APIs) like SGI Multipipe SDK and OpenProducer play an important role from performance point of view. The expertise in Stream Processing middleware such as AMD's Close-to-the-Metal,NVIDIA's CUDA and Stanford University's Brook is necessary to scale application performance and provide seamless integration with work-flow of complex applications, Coupled with Algorithm Acceleration. An extensive understanding of graphics hardware and systems' architectures allows users to leverage the computational power of Graphics processors (GPUs) for high-performance computing applications.

Stream computing harnesses the tremendous processing power of the graphics processor unit (GPU) for high-performance, data-intensive computations over a wide range of scientific, business and consumer applications. In stream computing, GPU's SIMD architecture is available and many cores are provided. The operations are applied in parallel through a SIMD architecture to a given data set, or stream of data. Stream computing offers a number of benefits to certain class of applications that are highly parallelizable.

Advanced Micro Devices, Inc. (AMD) Stream Computing is a first step in harnessing the tremendous processing power the GPU (Stream Processor) for high performance, data-parallel computing in a wide range of business, scientific and consumer applications. AMD's Stream Computing software stack provides flexible suite of tools to leverage the processing power of AMD Stream Processors (AMD FireStream\99 GPU) processors for end-users and developers. To take advantage of the GPU's SIMD architecture and the hundreds of parallel compute cores it provides, AMD Stream has developed a full software stack of development tools for both 32-bit and 64-bit Linux and Windows operating systems; the AMD Stream SDK. AMD Stream is also porting many common math library functions from the AMD ACML package to the GPU to support compute-intensive applications.


AMD Stream Accelerators - Software Stack & features

AMD's Stream Computing Software Stack It includes the following components:

  • Performance Libraries: AMD Core Math Library (ACML) and COBRA for optimized domain-specific Algorithms
  • Compilers: Brook+ and RapidMind

    Brook+ is AMD's implementation of the open source Brook C/C++ compiler that AMD is enhancing1 to include new features and back end which targets FireStream\99 GPU processors.

    Rapidmind is a complete development environment - C++ compilers and IDEs to improve programmability, performance and portability of 3rd party applications developed for AMD Stream Computing.

  • Lower Level Driver and Programming Language: AMD Compute Abstraction Layer (CAL)

    CAL provides access to the various parts of the GPU as needed. Developers are thus able to write directly to the GPU without needing to learn graphics specific programming languages. CAL provides direct communication to the device. Intermediate language specification provides low-level access to code, increasing the ability to fine-tune device performance.

  • Performance Profiling Tools: GPU ShaderAnalyzer, AMD CodeAnalyst

    GPU ShaderAnalyzer performs throughput and flow control analyses on Stream processors generating GUI-based performance data or command line generated reports. GPU Shader Analyzer (GSA) is a performance profiling tool useful for developing, debugging and profiling GPU kernels using high-level GPU programming languages. AMD CodeAnalyst is a software performance analysis tool, which includes system-wide profiling, as well as timer-based and event-based profiling, and sampling and simulation functionality.


Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

Important characteristics of AMD Stream computing & Features : AMD's latest generation of stream computing -GPU supports double precision floating point, which is critical feature for most high performance computing applications, is supported in hardware on the AMD FireStream 9170. The peak power consumption is 100 watts peak power consumption that is extremely low power requirements resulting in over 5GFLOPs per watt. From performance point of view, data can be moved without interrupting streams processor or CPU, which is Asynchronous DMA transfer that is supported. The programming in C-like environment high level complier is supported and it is built on the popular open source Brook with enhancements from and maintained by AMD (Brook+). It supports for 32/64 bit Linux and 32/64 bit Windows OS - Wide range of useful operating systems to ease deployment in various HPC environments. ATI shader analyzer, which further optimizes code through this popular freely vailable code tuning application, is available.

In stream computing, GPU's SIMD architecture is available and many cores are provided. The operations are applied in parallel through a SIMD architecture to a given data set, or stream of data. Stream computing offers a number of benefits to certain class of applications that are highly parallelizable. Advanced Micro Devices, Inc. (AMD-ATI) Stream Computing is a first step in harnessing the tremendous processing power the GPU (Stream Processor). for high performance, data-parallel computing in a wide range of business, scientific and consumer applications. AMD's Stream Computing software stack provides flexible suite of tools to leverage the processing power of AMD-ATI Stream Processors (AMD-ATI FireStream GPU) processors for end-users and developers.

AMD Accelerated Parallel Processing (AMD APP) SOftware : OpenCL

AMD Accelerated Parallel Processing (AMD APP) SOftware : AMD Accelerated Parallel Processing harnesses the tremendous processing power of GPUs for high-performance, data-parallel computing in a wide range of applications. The AMD Accelerated Parallel Processing system includes a software stack and the AMD GPUs. Please refer to AMD-APP Accelerated Parallel Processing (AMD APP) Programming Guide OpenCL to understand the relationship of the AMD Accelerated Parallel Processing components.

The AMD APP software stack provides end-users and developers with a complete, flexible suite of tools to leverage the processin power in AMD GPUs. AMD Accelerated Parallel Processing software embraces open-systems, open-platform standards. The AMD APP SDK open platform strategy enables AMD technology partners to develop and provide third-party development tools. AMD-APP OpenCL software development platform for x86-based CPUs and it provides complete heterogeneous OpenCL development platform for both the CPU and GPU. The software includes the following components:

  • OpenCL compiler and runtime
  • Device Driver for GPU compute device - AMD Compute Abstraction Layer (CAL)
  • Performance Profiling Tools - AMD APP Profiler and AMD APP KernelAnalyzer
  • Performance Libraries - AMD Core Math Library (ACML) for optimized NDRange-specific algorithms

The latest generation of AMD GPUs is programmed using the unified shade programming model. Programmable GPU compute devices execute various user-developed programs, called stream kernels (or simply: kernels). These GPUcompute devices can execute non-graphics functions using a data-parallel programming model that maps executions onto compute units. In this programming model, known as AMD Accelerated Parallel Processing, arrays of input data elements stored in memory are accessed by a number of compute units. Each instance of a kernel running on a compute unit is called a work-item. A specified rectangular region of the output buffer to which work-items are mapped is known as the n-dimensional index space, called an NDRange.

The GPU schedules the range of work-items onto a group of stream cores, until all work-items have been processed. Subsequent kernels then can be executed, until the application completes. In AMD Accelerated Parallel Processing programming model, the mapping of work-items to stream cores is performed. OpenCL maps the total number of work-items to be launched onto an n-dimensional grid (ND-Range). The developer can specify how to divide theseitems into work-groups. AMD GPUs execute on wavefronts (groups of work-items executed in lock-step in a compute unit); there are an integer number of wavefronts in each work-group. Thus, as shown in Figure 1.5, hardware that schedules work-items for execution in the AMD Accelerated Parallel Processingenvironment includes the intermediate step of specifying wavefronts within a work-group. This permits achieving maximum performance from AMD GPUs.


Work-Item Processing : All stream cores within a compute unit execute the same instruction for each cycle. A work item can issue one VLIW instruction per clock cycle. The block of work-items that are executed together is called a wavefront. To hide latencies due to memory accesses and processing element operations, up to four workitems from the same wavefront are pipelined on the same stream core.

Work-Item Creation : For each work-group, the GPU compute device spawns the required number of wavefronts on a single compute unit. If there are non-active work-items within a wavefront, the stream cores that would have been mapped to those work-items are idle.

Memory Architecture and Access : OpenCL has four memory domains: private, local, global, and constant; the AMD Accelerated Parallel Processing system also recognizes host (CPU) and PCI Express (PCIe) memory.

  • private memory - specific to a work-item; it is not visible to other work-items
  • local memory - specific to a work-group; accessible only by work-items belonging to that work-group
  • global memory - accessible to all work-items executing in a context, as well as to the host (read, write, and map commands).
  • constant memory - read-only region for host-allocated and -initialized objects that are not changed during kernel execution
  • host (CPU) memory - host-accessible region for an application's data structures and program data
  • PCIe memory - part of host (CPU) memory accessible from, and modifiable by, the host program and the GPU compute device. Modifying this memory requires synchronization between the GPU compute device and the CPU.
AMD APP SDK - Compute Abstraction Layer (CAL) Overview

The AMD Compute Abstraction Layer (CAL) is a device driver library that provides a forward-compatible interface to AMD GPU compute devices (see Figure 1.6). CAL lets software developers interact with the GPU compute devices at the lowest-level for optimized performance, while maintaining forward compatibility. CAL provides the following features:

  • Device management
  • Resource management
  • Code generation
  • Kernel loading and execution
  • Multi-device support
  • Interoperability with 3D graphics API

CAL provides a device driver library that allows applications to interact with the stream cores at the lowest level for optimized performance, while maintaining forward compatibility.

The CAL API is ideal for performance-sensitive developers because it minimizes software overhead and provides full-control over hardware-specific features that might not be available with higher-level tools. CAL includes a set of C routines and data types that allow higher-level software tools to control hardware memory buffers (device-level streams) and GPU compute device programs (device-level kernels). The CAL runtime accepts kernels written in AMD Intermediate Language (IL) and generates optimized code for the target architecture. It also provides access to device-specific features. The CAL API comprises one or more stream processors connected to one or more CPUs by a high-speed bus. The CPU runs the CAL and controls the stream processor by sending commands using the CAL API. The stream processor runs the kernel specified by the application. The stream processor device driver program (CAL) runs on the host CPU.

List of Programs - OpenCL - AMD APP SDK

The OpenCL programming model is based on the notion of a host device, supported by an application API, and a number of devices connected through a bus. These are programmed using OpenCL C-language. The Most OpenCL programs follow the same pattern. Given a specific platform, select a device or devices to create a context, allocate memory, create device-specific command queues, and perform data transfers and computations. The compiler tool-chain provides a common framework for both CPUs and GPUs, sharing the front-end and some high-level compiler transformations. The back-ends are optimized for the device type (CPU or GPU).

  • Example programs based on Simple Buffer write, SAXPY Operation, Parallel Min() Function, Prefix Operations Participants.

  • Example programs based on Numeircal Linear Algebra using OpenCL Optimised features

  • Example programs based on Numeircal Linear Algebra using BLAS Libraries on host-CPU and device GPU, focussing on Performance in terms of GFLOPS

  • Open Source Software based on Numeircal Linear Algebra and demonstrate Performance

  • Implementation of Matrix Computations (Iterative Methods to solve Ax=b Matrix System of linear equations) on Multi-GPUs

  • OpenCL - Perform sparse Matrix into vector multiplication Kernel

  • AMD-APP (using CAL) OpenCL : Write a simple HelloCAL application program using CAL of AMD Accelerated Parallel Processing Technology

  • AMD-APP: Write a Direct memory access (DMA) code to move data between the system memory and GPU local memory using CALMemCopyof AMD APP

  • AMD-APP : Write a code to use AMD-APP Asynchronous operations using CAL API of an application that must perform CPU computations in the application thread and also run another kernel on the AMD-APP stream processor

  • AMD-APP : Multiple device Write a matrix into vector multiplication based on self-scheduling algorithm using AMD-APP CAL CAL Application using Multiple Stream Processors Using calDeviceGetCount AMD-APP : CAL API of ADM-APP and ensure that application-created threads that are created on the CPU and are used to manage the communication with individual stream processors.

  • AMD-APP :Write a matrix into matrix computation on Hybrid Computing (HC) platform in which the computations of matrix into matrix multiply is performed judiciously on host-cpu using ACML -math library of AMD Opteron processor and CAL-OpenCL of AMD-ATI (GPU) APP.

  • AMd-APP Obtain maximum achievable matrix into matrix computation on Hybrid Computing (HC) platform in which the computations of block matrix into matrix multiply is performed judiciously using ACML -math library of AMD Opteron processor on host-cpu and tiled blocked matrix into matrix multiplication based on CAL-OpenCL on AMD-ATI (GPU) APP.

  • OpenCL Implementation of Solution of Partial differential Equations by finite differene method

  • OpenCL Implementation of Image Processing - Edge Detection Algorithms

  • OpenCL Implementation of String Search Algorithms

References

1. AMD Fusion
2. APU
3. All about AMD FUSION APUs (APU 101)
4. AMD A6 3500 APU Llano
5. AMD A6 3500 APU review
6. AMD APP SDK with OpenCL 1.2 Support
7. AMD-APP-SDKv2.7 (Linux) with OpenCL 1.2 Support
8. AMD Accelerated Parallel Processing Math Libraries (APPML)
9. AMD Accelerated Parallel Processing (AMD APP) Programming Guide OpenCL : May 2012
10. MAGMA OpenCL
11. AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with AMD APP Math Libraries (APPML); AMD Core Math Library (ACML); AMD Core Math Library for Graphic Processors (ACML-GPU)
12. Getting Started with OpenCL
13. Aparapi - API & Java
14. AMD Developer Central - OpenCL Zone
15. AMD Developer Central - SDKs
16. ATI GPU Services (AGS) Library
17. AMD GPU - Global Memory for Accelerators (GMAC)
18. AMD Developer Central - Programming in OpenCL
19. AMD GPU Task Manager (TM)
20. AMD APP Documentation
21. AMD Developer OpenCL FORUM
22. AMD Developer Central - Programming in OpenCL - Benchmarks performance
23. OpenCL 1.2 (pdf file)
24. OpenCL\99 Optimization Case Study Fast Fourier Transform - Part 1
25. AMD GPU PerfStudio 2
26. Open Source Zone - AMD CodeAnalyst Performance Analyzer for Linux
27. AMD ATI Stream Computing OpenCL - Programming Guide
28. AMD OpenCL Emulator-Debugger
29. GPGPU : http://www.gpgpu.org and Stanford BrookGPU discussion forum http://www.gpgpu.org/forums/
30. Apple : Snowleopard - OpenCL
31. The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group
32. Khronos V1.0 Introduction and Overview, June 2010
33. The OpenCL 1.1 Quick Reference card.
34. OpenCL 1.2 Specification Document Revision 15) Last Released November 15, 2011
35. The OpenCL 1.2 Specification (Document Revision 15) Last Released November 15, 2011 Editor : Aaftab Munshi Khronos OpenCL Working Group
36. OpenCL1.1 Reference Pages
37. MATLAB
38. OpenCL Toolbox v0.17 for MATLAB
39. NAG
40. AMD Compute Abstraction Layer (CAL) Intermediate Language (IL) Reference Manual. Published by AMD.
41. C++ AMP (C++ Accelerated Massive Parallelism)
42. C++ AMP for the OpenCL Programmer
43. C++ AMP for the OpenCL Programmer
44. MAGMA SC 2011 Handout
45. AMD Accelerated Parallel Processing Math Libraries (APPML) MAGMA
46. The OpenCL 1.2 Specification Khronos OpenCL Working Group
47. The OpenCL 1.2 Quick-reference-card ; Khronos OpenCL Working Group
48. Benedict R Gaster, Lee Howes, David R Kaeli, Perhadd Mistry Dana Schaa Heterogeneous Computing with OpenCL, Elsevier, Moran Kaufmann Publishers, 2011
49. Programming Massievely Parallel Processors - A Hands-on Approach, David B Kirk, Wen-mei W. Hwu nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011
50. OpenCL Progrmamin Guide, Aftab Munshi Benedict R Gaster, timothy F Mattson, James Fung, Dan Cinsburg, Addision Wesley, Pearson Education, 2012
51. AMD gDEBugger
52. The HSA (Heterogeneous System Architecture) Foundation
Centre for Development of Advanced Computing