C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

TBB Memory Allocators

Memory is a global resource, therefore it should be protected in multi-threaded environment. Allocating memory is not only one of the most basic programming tasks, it's also one of the most challenging to do efficiently in multithreaded programs on multiprocessor systems. Because memory allocation is such an essential requirement for programs, C++ offers several ways to plug in new memory allocators. Threading Building Blocks comes with a scalable allocator that supports the same signatures as std::allocator.

Problems in nonthreaded Memory Allocation :

When ordinary, nonthreaded allocators are used, memory allocation becomes a serious bottleneck in a multithreaded program because each thread competes for a global lock for each allocation and deallocation of memory from a single global heap.

Programs that run this way are not scalable. In fact, because of this contention, programs that make intensive use of memory allocation may actually slow down as the number of processor cores increases!

Another serious issue for concurrent programs is called false sharing. False sharing occurs when multiple threads use memory locations that are close together, even if they are not actually using the same memory locations.

Scalable Memory Allocators :

The solution to the challenges of concurrent memory allocation is to use a scalable memory allocator, either in Intel Threading Building Blocks or in another third-party solution. The Threading Building Blocks scalable memory allocator utilizes a memory management algorithm divided on a per-thread basis to minimize contention associated with allocation from a single global heap.

In the TBB scalable memory allocator, each thread uses its own memory heap for object allocation. There is no global lock, the allocation is fast and in most cases does not require any lock to be acquired. However there is no guarantee of non-blocking behavior, as from time to time the allocator needs to access the global memory pool to request a new big piece of memory. Eventually, every allocator uses system calls such as mmap or VirtualAlloc to request memory from an operating system, which also somehow protects consistency.

Threading Building Blocks offers two choices :

scalable_allocator

This template offers just scalability, but it does not completely protect against false sharing. Memory is returned to each thread from a separate pool, which helps protect against false sharing if the memory is not shared with other threads.

cache_aligned_allocator

This template offers both scalability and protection against false sharing. It addresses false sharing by making sure each allocation is done on a cache line.

mmap

Memory-mapped I/O lets us map a file on disk into a buffer in memory so that, when we fetch bytes from the buffer, the corresponding bytes of the file are read. Similarly, when we store data in the buffer, the corresponding bytes are automatically written to the file.

The mmap() function establishes a mapping between a process' address space and a file or shared memory object. The format of the call is as follows

void* mmap(void *start,size_t length,int prot,int flags,int fd,off_t offset);

The mmap() function asks to map length bytes starting at offset offset from the file (or other object) specified by the file descriptor fd into memory, preferably at address start. This latter address is a hint only, and is usually specified as 0. The actual place where the object is mapped is returned by mmap().

The parameter prot determines whether read, write, execute, or some combination of accesses are permitted to the data being mapped. The prot should be either PROT_NONE or the bitwise inclusive OR of one or more of the other flags in the following table, defined in the header <sys/mman.h>.

hyPACK-2013 Mode 1 : Intel Threading Building Blocks