C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

On NVIDIA GPU, memory is allocated with cudaMalloc(C). CUDA runtime offers its mechanism for allocating host memory cudaHostAlloc(). On host, C library routine malloc() can also be used to allocate memory.

It is important to note that there is a significant difference between the memory that malloc() will allocate and the memory that cudaHostAlloc() allocates. The C library function malloc() allocates standard, pageble host memory, while cudaHostAlloc() allocates a buffer of page-locked host memory. Some times it is called as pinned memory, page-locked buffers have an important property. OS ensures its residency in physical memory and it guarantees that this memory will never page this memory out to disk. This means, the buffer is not evicted or relocated and hence, OS allow an application access to the physical address of this memory.

The CUDA runtime provides functions to allow the use of page-locked (also known as pinned) host memory (as opposed to regular pageble host memory allocated by malloc()): cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory; cudaHostRegister() page-locks a range of memory allocated by malloc().

Knowing the Physical address of a buffer, the GPU can then use direct memory access (DMA) to copy data to or from the host. DMA copies proceed without intervention from the cPU, it also means that the CPU could be simultaneously paging these buffers out to disk or relocating their physical address by updating the OS's page-tables. The CUDA runtime driver still uses DMA to transfer the buffer to the GPU, while performing a memory copy with pageble memory.

It is importatnt to observe that the copy speed of memory copies from pageble memory is bounded by the lower of the PCIe transfer speed and the system front-side bus. The benchmarks are focusing on cudaMemcpy() performance with both pageble and page-locked memory. It should be noted that pageble buffers would still incur the overhead of an additional CPU-managed copy, even the PCI Express and front-side-bus speeds are identical. Free the memory when they no longer needed rather than waiting until application releases the memory while memory is used as a source or destination in calls to cudaMemcpy(). Using page-locked host memory has several benefits: