C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compiler Opt. Features Threads-Perf. Math.Lib. Threads-Prof. & Tools Threads-I/O Perf. PGAS : UPC / CAF / GA Power-Perf. Reference Home

Prog. on GPUS : GPGPUs /GPU Computing : References & Web sites


[GPUComp-01].	Randi J. Rost, OpenGL \96 shading Language, Second Edition, Addison Wesley 2006

[GPUComp-02].	GPGPU Reference http://www.gpgpu.org

[GPUComp-03].	NVIDIA http://www.nvidia.com

[GPUComp-04].	NVIDIA Tesla http://www.nvidia.com/object/tesla_computing_solutions.html

[GPUComp-05].	CUDA sample source code: http://www.nvidia.com/object/cuda_get_samples.html

[GPUComp-06].	AMD Stream Processors http://ati.amd.com/products/streamprocessor/specs.html

[GPUComp-07].	OpenCL - The open standard for parallel programming of heterogeneous systems http://www.khronos.org/opencl

[GPUComp-08].	List of NVIDIA GPUs compatible with CUDA:f heterogeneous systems http://www.nvidia.com/object/cuda_learn_products.html

[GPUComp-09].	RAPIDMIND http://www.rapidmind.net

[GPUComp-10].	Peak Stream - Parallel Processing (Acquired by Google in 2007) http:/www.google.com

[GPUComp-11].	guru3d.com http://www.guru3d.com/news/sandra-2009-gets-gpgpu-support/

[GPUComp-12].	NVIDIA, NVIDIA CUDA, Programming Guide, v. 2.3, NVIDIA Corporation (2009).
[GPUComp-02].	CUDA Zone - http://www.nvidia.com/object/cuda_home.html
[GPUComp-13].	J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruuger, A. E Lefohn, T. J. Purcell, A Survey of General-Purpose Computation on Graphics Hardware, Computer Graphics Forum (2007), Vol 26, pages 80 - 113s
[GPUComp-14].	M. Harris, Optimizing NVIDIA CUDA, Presentation at AstroGPU conference (2007).
[GPUComp-15].	G. Ruestch, P. Micikevicius, Optimizing Matrix Transpose in CUDA, Tech report, NVIDIA Corporation (2009).
[GPUComp-16].	GPU Gems book series (available online), GPU Gems: http://developer.nvidia.com/object/gpu_gems_home.html http://developer.nvidia.com/object/gpu_gems_2_home.html http://developer.nvidia.com/object/gpu-gems-3.html
[GPUComp-17].	G. Ruestch, P. Micikevicius, Optimizing Matrix Transpose in CUDA, Tech report, NVIDIA Corporation (2009).
[GPUComp-18].	M. Harris, Parallel Prefix Sum (Scan) with CUDA, Tech report, NVIDIA Corporation (2008).
[GPUComp-19].	N. Sathish, M. Harris, M. Garland, Designing Efficient Sorting Algorithms for Many-core GPUs, Tech report, NVIDIA Corporation (2008).
[GPUComp-20].	J. Meng, K. Skadro, Performance Modeling and Automatic Ghost Zone Optimization for Iterative Stencil Loops on GPUs, ICS \9209: Proceedings of the 23^rd international conference on Supercomputing (2009), 256 - 265.
[GPUComp-21].	M. Harris, GPU Gems: Chapter 38 - Fast Fluid Dynamics Simulation on the GPU, GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics - NVIDIA Corporation (2007).
[GPUComp-22].	P. Micikevicius, 3D Finite Difference Computation on GPUs using CUDA, Tech report, NVIDIA Corporation (2009).
[GPUComp-23].	M. Bader, H-J. Bungartz, D. Mudigere, S. Narasimhan, B. Narayanan, Optimized CUDA Implementation of a Navier-Stokes based flow solver for the 2D Lid Driven Cavity, poster at the NVIDIA GPU research summit (2009).
[GPUComp-24].	J. M. Cohen, M. J. Molemaker, A Fast Double Precision CFD Code using CUDA, Tech report, NVIDIA Corporation (2009).
[GPUComp-25].	M. Harris, Parallel Prefix Sum (Scan) with CUDA, Tech report, NVIDIA Corporation (2008).
[GPUComp-26].	RAPIDMIND & AMD http://www.rapidmind.net/News-Aug4-08-SIGGRAPH.php
[GPUComp-27].	Merrimac - Stream Architecture Standford Brook for GPUs http://www-graphics.stanford.edu/projects/brookgpu/
[GPUComp-28].	Standford : Merrimac - Stream Architecture http://merrimac.stanford.edu/
[GPUComp-29].	ATI RADEON - AMD http://www.canadacomputers.com/amd/radeon/
[GPUComp-30].	Sparse Matrix Solvers on the GPU ; conjugate Gradients and Multigrid by Jeff Bolts, Ian Farmer, Eitan Grinspum, Peter Schroder, Caltech Report (2003); Supported in part by NSF, NVIDIA
[GPUComp-31].	Scan Primitives for GPU Computing by Shubhabrata Sengupta, Mark Harris, Yao Zhang and John D Owens University of California Davis & nVIDIA Corporation Graphic Hardware (2007).
[GPUComp-32].	Scan Primitives for GPU Computing by Shubhabrata Sengupta, Mark Harris, Yao Zhang and John D Owens University of California Davis & nVIDIA Corporation Graphic Hardware (2007).
[GPUComp-33].	Scan Primitives for GPU Computing by Shubhabrata Sengupta, Mark Harris, Yao Zhang and John D Owens University of California Davis & nVIDIA Corporation Graphic Hardware (2007).
[GPUComp-34].	Bollz J., Farmer I., Grinspun F., Schroder F : Sparse Matris Solvers on the GPU ; Conjugate Gradients and multigrid ACM Transactions on Graphics (Proceedings of ACM SIGRAPH 2003) 22, 2 (Jul y2003) pp 917-924 Graphic Hardware (2007).
[GPUComp-35].	Number crunching with GPUs PeakStream Math API Exploits Parallelism in Graphics Processors, Ocotober 2006; Microprocessor http://www.mdronline.com
[GPUComp-36].	Tom R. Halfhill, Parallel Processing with CUDA Nvidia's High-Performance Computing Platform Uses Massive Multithreading ; Microprocessors, Volume 22, Archive 1, January 2008 http://www.mdronline.com
[GPUComp-37].	I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Hoston, P.Hanrahan, Brook for GPUs ; Stream Computing on GRaphics Hadrware, ACM Tran. GRaph (SIGGRAPH) 2008
[GPUComp-38].	J. Kriiger, R. Wetermann, Linear Algeria operators for GPU implementation of Numerical Algorithms ACM Tran, Graph (SIGGRAPH) 22 (3) pp. 908-916. (2003)
[GPUComp-39].	Tutorial SC 2007 : High Performance Computing with CUDA
[GPUComp-40].	FASTRA http://www.fastra.ua.ac.be/en/faq.html
[GPUComp-41].	AMD Stream Computing software Stack http://www.amd.com
[GPUComp-42].	BrookGPU : http://graphics.standafrod.edu/projects/brookgpu/index.html
[GPUComp-43].	Tom R Halfhill, Intel\92s Larrabee Redefines GPUs \96 Fully Programmable Many core Processor Reaches Beyond Graphics, Microprocessor Report September 29, 2008
[GPUComp-44].	Tom R Halfhill AMD\92s Stream Becomes a River \96 Parallel Processing Platform for ATI GPUs Reaches More Systems, Microprocessor Report December 2008
[GPUComp-45].	General-purpose computing on graphics processing units (GPGPU) http://en.wikipedia.org/wiki/GPGPU
[GPUComp-46].	Khronous Group, OpenGL 3, December 2008 http://www.khronos.org/opengl
[GPUComp-47].	Perry H. Wang1, Jamison D. Collins1, Gautham N. Chinya1, Hong Jiang2, Xinmin Tian3 , EXOCHI: Architecture and Programming Environment for A Heterogeneous Multi-core Multithreaded System, PLDI\9207
[GPUComp-48].	Daniel Weiskopf, Basics of GPU-Based Programming, Institute of Visualization and Interactive Systems, Interactive Visualization of Volumetric Data on Consumer PC Hardware: Basics of Hardware-Based Programming University of Stuttgart, VIS 2003
[GPUComp-48].	GPU Programming Languages http://www.cis.upenn.edu/~suvenkat/700/
[GPUComp-49].	OpenGL design http://graphics.stanford.edu/courses/cs448a-01-fall/design_opengl.pdf
[GPUComp-50].	OpenCL - The open standard for parallel programming of heterogeneous systems http://www.khronos.org/opencl
[GPUComp-51].	Mary Fetcher and Vivek Sarkar, Introduction to GPGPUS \96 Seminar on Heterogeneous Processors, Dept. of computer Science, Rice University, October 2007
[GPUComp-52].	C-DAC Technology Workshops PEEP-2008 & OPECG-2009 http://www.cdac.in
[GPUComp-53].	NVIDIA CUDA Quick Start Guide 2007-2009 http://www.nvidia.com/object/cuda_develop.html
[GPUComp-54].	NVIDIA OpenCL Best Practices Guide Version 1.0 August 2009 http://www.nvidia.com
[GPUComp-55].	NVIDIA OpenCL Getting Started Guide Version 2009 http://www.nvidia.com
[GPUComp-56].	NVIDIA OpenCL Programming Guide for the CUDA Architecture Version 2.3 August 2009 http://www.nvidia.com
[GPUComp-57].	NVIDIA OpenCL JumpStart Guide Technical Brief Version 0.9 April 2009 http://www.nvidia.com
[GPUComp-57].	The OpenCL Specification version 1.0, Published by Khronous OpenCL Working Group, ed. : Aftab Munshi 2009 http://www.khronos.org/registry/cl
[GPUComp-58].	Programming Guide AMD - ATI Stream Computing - Compute Abstraction Layer (CAL) March 2010 http://www.amd.com
[GPUComp-59].	AMD - ATI Stream http://www.amd.com/stream
[GPUComp-60].	Programming Guide - AMD - ATI Stream Computing - OpenCL March 2010 http://www.amd.com/stream
[GPUComp-61].	AMD - ATI Stream Developer Forum http://www.amd.com/streamdevforum
[GPUComp-62].	OpenGL Programming Guide http://www.glprogramming.com/red/
[GPUComp-63].	GPGPU http://www.gpgpu.org Standford discusison forum http://www.gpgpu.org/forums/
[GPUComp-64].	Techncial Notes -ATI Stream SDK V2.01 Performance and Optimization http://www.amd.com/stream
[GPUComp-65].	Microsoft DirectX Reference Web site http://www.msdn.microsoft.com/en-us/directx
[GPUComp-66].	I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, \93Brook for GPUs: stream computing on graphics hardware,\94 ACM Trans. Graph., vol. 23, no. 3, pp. 777\96786, 2004
[GPUComp-67].	Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Hanrahan, Pat; Houston, Mike; Fatahalian, Kayvon. \93BrookGPU\94 http://graphics.stanford.edu/projects/brookgpu/
[GPUComp-68].	ATI Compute Abstraction Layer (CAL) Intermediate Language (IL) Reference Manual. Published by AMD.
[GPUComp-69].	CAL Image. ATI Compute Abstraction Layer Program Binary Format Specification. Published by AMD.
[GPUComp-70].	Kernighan Brian W., and Ritchie, Dennis M., The C Programming Language, Prentice-Hall, Inc., Upper Saddle River, NJ, 1978.
[GPUComp-71].	Computational Methods for Tomography - Medical Image Processing http://www.fastra.ua.ac.be
[GPUComp-72].	GPU Gems 3 : Chapter 37 Efficient Random Number Generation and ApplciationUsing CUDA Lee Howes, David Thomas, Imperial College London
[GPUComp-73].	NVIDIA's Fermi : The First Complete GPU Computing Architecture, A white paper by Peter N Glasowsky (Prepared under contract ith NVIDIA Coporation), September 2009
[GPUComp-74].	White Paper Loking Beyond Graphics - NVIDIA's NExt-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Nuslce fo Parallel Computing Analyst : Tom R HalfHill, September 2009 Sponsored by NVIDIA http://www.in.star-com
[GPUComp-75].	White Paper Loking Beyond Graphics - NVIDIA's NExt-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Nuslce fo Parallel Computing Analyst : Tom R HalfHill, September 2009 Sponsored by NVIDIA http://www.in.star-com
[GPUComp-76].	Director, Parallel Computing Research Laboratory (Par Lab), U.C. Berjeley The top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges, September 30, 2009 (NVIDIA is one of eight sponsors of the Par. Lab.
[GPUComp-77].	The Protland Group - CUDA Fortran Programming Guide and Reference Published November 2009 http://www.pgroup.com/resources/accel.htm
[GPUComp-78].	The Portland Grpup - PGI Accelerator Compilers - CUDA enabled NVIDIA GPUs http://www.pgroup.com/resources/accel.htm
[GPUComp-79].	GPU Computing Solutions - NVIDIA Tesla & CUDA http://www.nvidia.com/tesla http://www.nvidia.com/cuda
[GPUComp-80].	Nvidia CUDA :Practical uses - BeHardwaqre, DAmien Triolet Aug 2007 http://www.behardware.com/art/lire/678/
[GPUComp-81].	Sain-Zee Ueng, Melvin Lathara, Sara S BAghsorkhi, and Wen-mei W Hwu CUDA-lite : Reducing GPU Programming Complexity, Center for Reliable and High-Performance Computing Dept of Electrical & CVomp. Engg, Univ of Illinois at Urbana-Champagin
[GPUComp-82].	Yao Zgang Jonathan Cohen, John D Owens Fast Tridiagonal Solvers on the GPU University of California, Davis, Nvidia
[GPUComp-83]	Bharatkumar Sharma,Rahul Thota,Naga Vydyanathan,and Amit Kale Towards a Robust,Real-time Face Processing System using CUDA-enabled GPUs Siemens Corporate Techchnology Banglore,India
[GPUComp-84]	kishore Kothapalli Rishabh,Mukherjee,M.Suhail Rehman,Suryakant Patidar,P.J.Narayanan,Kannan Srinathan A Performance Prediction Model for the CUDA GPGPU Platform International Institute of Information Technology,Hyderabad,India
[GPUComp-85]	John Nickolls,Ian Buck and Michael Garland,NVIDIA,Kevin Skadronn Scalable Parallel Programming Scalable Parallel Programming
[GPUComp-86]	N.P.Karunadasa & D.N.Ranasinghe On the comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters University of Colombo School of Computing,Srilanka
[GPUComp-87]	Michael Bader,Hans-Joachim Bungartz,Dheevatsa, Srihari Narasimhan,Babu Narayanan Fast GPGPU Data Rearrangement kernels using CUDA Technische universitat Munchen,Munich,Germany, GE Global Research,JFWTC,Bangalore,India
[GPUComp-88]	M.Sussman,W.Crutchfield and M.Papakinos Pseudorandom Number Generation on the GPU PeakStream,Inc.,Redwood City,CA,USA
[GPUComp-89]	W.B.Langdon A Fast High Quality Pseudo Random Number Generator for nVidia CUDA Department of Computer Science,CREST Centre,King's College,London,WC@R 2LS,UK
[GPUComp-90]	Sara S.Baghsorkhi, Matthieu Delahaye, Sanjay J.atel, William D.Gropp.Wen-mei W.Hwu An Adaptive Performance Modeling Tool for GPU Architectures University of Illionois at Urbana-Champaign,UrbanamIL 61801
[GPUComp-91]	David B. Kirk Wen-mei W. HWu Programming Massively Parallel Processors - A Hands-on Approach Morgan Kaufmann Publishers, 2010
[GPUComp-92]	Dheevatsa Mudigere, Data access optimized applicatios on the GPU using NVIDIA CUDA, Thesis - Master of Science in Computational Science and Engineering, TECHNISCHE UNIVERSITY MUNCHEN,Germany ,October 2009
[GPUComp-93]	Dheevatsa Mudigere (Technischen Universit\E4t M\FCnchen (TUM), Munich, Germany, DE) Fast GPGPU Data Rearrangement Kernels using CUDA , Student Research Symosium, International Conference HiPC-2009, HiPC, Kochi, (Kerla,India), December 2009
[GPUComp-94]	Khronos Group (2009). The OpenCL Specification Version 1.0. Beaverton, OR: Khronos Group http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf
[GPUComp-95]	Message Passing Interface Forum. (2009). MPI: A Message-Passing Interface Standard, Version 2.2. Knowville: University of Tennessee. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf
[GPUComp-96]	OpenMP Architecture Review Board. (2005). OpenMP Application Program Interface. http://www.openmp.org/mp-documents/spec25.pdf
[GPUComp-97]	Buck, I., Foley, T., Horn, D., Sugerman, K., Fatahlian, K., Houston, M., et al. (2004). Brooks for GPUs: Stream computing on graphics hardware. ACM Transaction on Graphics, 23(3), 777-786 http://doi.acm.org/10.1145/1186562.1015800
[GPUComp-98]	Fernanco, R. (Ed.), GPU gems: Programming techniques, tips, and tricks for realtime graphics. Reading, MA: Addison-Wesley http://developer.nvidia.com/object/GPU_Gems_Home.html
[GPUComp-99]	Nickolls, J., Buck, I., Garland M., & Skadron, K. (2008). Scalable parallel programming with CUDA. ACM Queue, 6(2), 40-53.
[GPUComp-100]	NVIDIA. (2007b), NVIDIA computer-PTX: Parallel thread execution, ISA Version 1.1 http://nvidia.com/object/io_1195170102263.html
[GPUComp-101]	NVIDIA. (2009). CUDA Zone http://www.nvidia.com/CUDA
[GPUComp-102]	Segal, M., & Akeley, K. (2006). The OpenGL\AE graphics system: A specification, Version 2.1. Mountain View, CA: Silicon Graphics http://www.opengl.org/documentation/specs/
[GPUComp-103]	Sengupta, S., Harris M., Zhang, Y., & Owens, J. D. (2007). Scan primitives for GPU computing. In T. Aila & M. Segal (Eds.), Graphics hardware (pp. 97-106). San Diego, CA: ACM Press.
[GPUComp-104]	Stratton, J. A., Stone, S., & Hwu, W. W. (2008). MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC). Canada: Edmontion.
[GPUComp-105]	Ryoo, S., Rodrigues, C. I., Baghsorkhi, S. S., & Stone, S. S. (2008). Optmization principles and application performance evaluation of a multithereaded GPU using CUDA. In Proceedings of the 13th ACL SIGPLAN Symposium of Pringicples and Practice of Parallel Progrmaming (pp. 73-82). Salt lake City, UT.
[GPUComp-106]	Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Ueng, S. Z., Stratton, J. A. et al. (2008). Program prunning for a multithreaded GPU. In Code generation and optimization: Proceedigns of the Sixth Annual IEEE/ACM International Symposium on code generation and optimization (pp. 195-204). Boston, MA.
[GPUComp-107]	Khronos Group (2010). OpenCL implementations, tutorials, and sample code. Beaverton, OR: Khronos Group. http://www.khronos.org/developers/resources/opencl/
[GPUComp-108]	NVIDIA. (2010). OpenCL GPU computing support on NVIDIA\92s CUDA architecture GPUs. Santa Clara, CA: NVIDIA. http://www.nvidia.com/object /cuda_opencl.html

Centre for Development of Advanced Computing