C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

The using directive imports the namespace tbb where all of the library's classes and functions are found. The namespace is explicit in the first mention of a component, but implicit afterwards. So with the using namespace statement present you can use the library component identifiers without having to write out the namespace prefix tbb before each of them.

The task scheduler is initialized by instantiating a task_scheduler_init object in the main function. The definition for the task_scheduler_init class is included from the corresponding header file. Actually any thread using one of the provided TBB template algorithms must have such an initialized task_scheduler_init object. The default constructor for the task_scheduler_init object informs the task scheduler that the thread is participating in task execution, and the destructor informs the scheduler that the thread no longer needs the scheduler.

With the newer versions of Intel TBB as used in a MIC environment the task scheduler is automatically initialized, so there is no need to explicitely initialize it In the simplest form scalable parallelism can be achieved by parallelizing a loop of iterations that can each run independently from each other.

The parallel_for template function replaces a serial loop where it is safe to process each element concurrently. The template function tbb::parallel_for breaks the iteration space into chunks, and runs each chunk on a separate thread. The first parameter of template function call parallel_for is a blocked_range object that describes the entire iteration space from 0 to n-1. The parallel_for divides the iteration space into subspaces for each of the over 200 hard-ware threads. blocked_range<T> is a template class provided by the TBB library describing a one-dimensional T. The parallel_for class works just as well with other kinds of iteration spaces. The library provides blocked_range2d for two-dimensional spaces.

There exists also the possibility to define own spaces. The general constructor of the blocked_range template class is blocked_range(begin,end,grainsize) . The T. specifies the value type. begin represents the lower bound of the half-open range interval [begin,end) representing the iteration space. end represents the excluded upper bound of this range. The grainsize is the approximate number of elements per sub-range. The default grainsize is 1.

A parallel loop construct introduces overhead cost for every chunk of work that it schedules. The MIC adapted Intel TBB library chooses chunk sizes automatically, depending upon load balancing needs. The heuristic normally works well with the default grainsize. It attempts to limit overhead cost while still providing ample opportunities for load balancing. For most use cases automatic chunking is the recommended choice. There might be situations though where controlling the chunk size more precisely might yield better performance.