C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

hyPACK-2013 Mode-4 : Power Management - NVML CUDA enabled NVIDIA GPU

NVIDIA's Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the company's powerful GPUs. NVIDIA's software CUDA programming model effectively use GPUs which could be harnessed for tasks other than graphics, achieving teraflops of computing power. For high performance computing, the programming model has been designed to improve the shaders, which is commonly used in terminology in Graphics Computing and shaders are called as stream processing or thread processing.

The CUDA programming paradigm consists of a host and one or more devices. The host manages the memory and execution of the devices. CUDA\A0uses\A0a\A0segmented\A0memory\A0architecture\A0that\A0allows applications\A0to\A0access\A0data\A0in\A0global,\A0local,\A0 shared,\A0constant,\A0and\A0texture\A0memory. On GPUs, the memory operations from global memory are not only very time consuming but also power consuming. GPU uses a single SM, the power should remain the same, and power scales with respect to the number of SMs used.

Global memory is used to allocate or copy data between the host and device (GPU). Bandwidth between host and device memory is very low compared to data transfer within the GPU, therefore communication between host and device should be minimized. There is an overhead per communication, so single large transfers are better than many small transfers. Global memory is located in the main device memory, and data accesses from the SM to global memory are high latency (400-800 clock cycle) and low bandwidth (compared to on chip memory).

The latency can be hidden to some extent if there are a large number of active threads. Access to global memory from the SM can be improved using coalescing. We use these rules to show power consumed by coalesced memory.

Registers are associated with each SM and give the fastest access. Registers can store scalars and built-in vector types. Arrays indexed by constant values known at compile time typically reside in registers. On CUDA enabled NVIDIA GPUs, the size of the registers do not exceed 32 K since 32 K is the register space allocated per SM. Register spilling is very costly as it may result in data being placed in local memory rather than registers.

A floating point benchmark based on Taylor's thorem for Numerical Linear Algebra (NLA) kernel is considered for development of benchmarks.

Shared memory, which is software managed cache, is on chip memory which has high bandwidth and low latency. It can be used for thread cooperation as this memory is shared between all threads within a block. Shared memory is divided into successive equal sized banks, i.e. 32 x 32-bit for C2075, that can be accessed simultaneously.

Shared memory can be as fast as the registers if bank con icts are avoided. Multiple requests to the same bank result in serialization unless all threads read the same address.

Coalesced Memory : Since access to global memory is via 32, 64, or 128 byte accesses, the benchmarks can be desined in such a way each thread can access it in a regular pattern of 128 bytes. Coalesced memory accesses are very important for instruction throughput. The local and global variables use global memory. If memory accesses to global memory which are not regular patterns to global memory are called noncoalesced accesses. The Pthread programming on Xeon Host and offload the device query operation on GPU. The other thread obtains the power consumption in Milli-watts as per calculations performed using NVML power APIs at periodic intervals of time. In implementation, NVML APIs such as

nvmlInit();
nvmlDevice_t device;
nvmlReturn_t result;
nvmlDeviceGetHandleByIndex(GPUDevId , &device);
nvmlDeviceGetPowerUsage(device, &p );

are used in this code.