C-DAC,Pune : High-Perf. Comp. Frontier Technologies Exploration Group and CMSD, University of Hyderabad, Technology Workshop hyPACK (October 15-18), 2013

Overview Venue : CMSD, UoH Key-Note/Invited Talks Faculty / Speakers Proceedings Downloads Past Tech. Workshops Target Audience Benefits Organisers Accommodation Local Travel Sponsors Feedback Acknowledgements Contact Home

Topics of Interest Tech. Prog. Schedule Topic : Multi-Core Topic : ARM Proc. Topic : Coprocessors Topic : GPGPUs Topic : HPC Cluster Topic : App. Kernels. Topic : Lab. Session Key-Note / Invited Talks Home

Mode-1 Multi-Core Memory Allocators OpenMP Intel TBB Pthreads Java - Threads Charm++ Prog. Message Passing (MPI) MPI - OpenMP MPI - Intel TBB MPI - Pthreads Compilers - Opt. Features Threads-Perf. Math. Lib. Threads-Prof. & Tools Threads - I/O Perf. PGAS : UPC / CAF/ GA Power & Perf. Home

Mode-2 ARM Prog. Env Benchmarks Power & Perf. Home

Mode-3 Coprocessors Arch. Software Compiler & Vect. Prog. Env. Benchmarks Power & Perf. Home

Mode-4 GPGPUs NVIDIA - CUDA/OpenCL AMD APP - OpenCL GPGPUs - OpenCL GPGPUs : Power & Perf. Home

Mode-5 HPC Cluster HPC MPI Cluster GPU Cluster - NVIDIA GPU Cluster - AMD APP Cluster - Intel Coprocessors Cluster- Power & Perf. Home

Mode-6 App. Kernels PDE Solvers : FDM/FEM Image Processing - FFT Monte Carlo Methods String Srch. Seq. Analy. Video Process. Intr. Detcn. Sys App. Power & Perf. Home

Reg. Overview Pvt. Sector Pub. Sector Govt. Acad. Staff Students Reg. On-line Reg. Accommodation Contact Home

• Mode-1 Multi-Core • Memory Allocators • OpenMP • Intel TBB • Pthreads • Java - Threads • Charm++ Prog. • Message Passing (MPI) • MPI - OpenMP • MPI - Intel TBB • MPI - Pthreads • Compiler Opt. Features • Threads-Perf. Math.Lib. • Threads-Prof. & Tools • Threads-I/O Perf. • PGAS : UPC / CAF / GA • Power-Perf. • Home

Programming on Multi-Core Processors Using OpenMP APIs

The OpenMP API is used for writing portable multi-threaded applications written in Fortran, C and C++ languages. The OpenMP programming model plays a key role by providing an easy method for threading applications without burdening the programmer with the complications of creating, synchronizing load balancing, and destroying threads. The OpenMP model provides a platform independent set of compiler pragmas, directives, function calls, and environment variables that explicitly instruct the compiler how and where to use the parallelism in the application. Example programs using compiler pragmas, directives, function calls, and environment variables, Compilation and execution of OpenMP programs, programs numerical and non-numerical computations. are discussed.

Example 1.1	Write a OpenMP program to print unique identifier
Example 1.2	Write a "Hello world" Program Using OpenMP pragmas
Example 1.3	Illustrate a program for loop recurrence using OpenMP PARALLEL FOR directive
Example 1.4	Write a OpenMP program to find Sum of Natural Numbers using OpenMP Parallel FOR directive
Example 1.5	Write a OpenMP program to find Sum of Natural Numbers using OpenMP REDUCTION clause
Example 1.6	Write a OpenMP program for Loop-carried dependence using OpenMP parallel Directive
Example 1.7	Write a OpenMP program to illustrate Data Race condition
Example 1.8	Write a OpenMP program to illustrate Managing Shared & Private Data
Example 1.9	Write a OpenMP program to illustrate Loop Scheduling & Partitioning - Optimize Performance
Example 1.10	Write a OpenMP program to illustrate Work-Sharing Sections
Example 1.11	Write a OpenMP program to illustrate the performance improvement.

(Source - References : Books Multi-threading OpenMP -[MCMTh-01], [MCMTh-02], [MCMTh-I03], [MCMTh-05], [MCMTh-09], [MCMTh-11], [MCMTh-15], [MCMTh-21], [MCBW-44], [MCOMP-01], [MCOMP-02], [MCOMP-04], [MCOMP-12], [MCOMP-19], [MCOMP-25])

Description of OpenMP Programs

Example 1.1 OpenMP program : Print identifier using OpenMP PARALLEL directive.
( Download source code : omp-unique-threadid.c / omp-unique-threadid.f )

Objective

Write a simple OpenMP program to print unique thread identifier for each thread using OpenMP PARALLEL directive.

Description

Each thread prints its thread identifier. In this program OpenMP PARALLEL directive and OMP_GET_THREAD_NUM run-time library routine is used. The PARALLEL directive pair specifies that the enclosed block of program, referred to as PARALLEL region, be executed in parallel by multiple threads. The PRIVATE clause is typically used to identify variables that are used as scratch storage in the program segment within the parallel region. It provides a list of variables and specifies that each thread have a private copy of those variables for the duration of the parallel region.

Input

User has to set OMP_NUM_THREADS environment variable for n number of threads.
e.g. To set the number of threads to use by means of the OMP_NUM_THREADS environment variable.

setenv OMP_NUM_THREADS 4 (if using csh / tcsh shell)

export OMP_NUM_THREADS = 4 (if using bash / ksh shell)

Output

Each thread prints its thread identifier.

Example 1.2 : OpenMP program : Hello World using OpenMP PARALLEL directive
(Download source code : omp-hello-world.c / omp-hello-world.f )

Objective

Write a simple OpenMP program to print "Hello World" using OpenMP PARALLEL directives.

Description

Each thread prints "Hello World" message. In this program OpenMP PARALLEL directive, PRIVATE clause and OMP_GET_THREAD_NUM run-time library routine is used. The PARALLEL directive pair specifies that the enclosed block of program, referred to as PARALLEL region, be executed in parallel by multiple threads. The PRIVATE clause is typically used to identify variables that are used as scratch storage in the program segment within the parallel region. It provides a list of variables and specifies that each thread have a private copy of those variables for the duration of the parallel region.

Input

User has to set OMP_NUM_THREADS environment variable for n number of threads. e.g. To set the number of threads to use by means of the OMP_NUM_THREADS environment variable.

setenv OMP_NUM_THREADS 4 (if using csh / tcsh shell)
export OMP_NUM_THREADS = 4 (if using bash / ksh shell)

Output

Each thread prints a message "Hello World" and its identifier.

Example 1.3 : Illustrate a program for loop recurrence using OpenMP PARALLEL FOR directive.
(Download source code ; omp-recurrence.c )

Objective

Write a simple OpenMP program for parallelization of a loop nest containing a recurrence using OpenMP PARALLEL FOR directive.

do j = 1, N
    do i = 1, N
        a(i,j) = a(i,j) + a(i,j-1)
    end do
end do

Description

In this program the j loop contains a recurrence i.e. the loop contain the data dependence each iteration write an element of InputMatrix that is read by the next iteration ,that is difficult to remove, so the i loop is parallelized instead. We have used OpenMP PARALLEL FOR directive. The choice of which loop in the nest executes in parallel can have a profound impact on performance, so the parallel loop must be chosen carefully.

Input

Number of threads and Size of Matrix

Output

The status of execution i.e. The status of comparison of parallel and serial result of a loop nest containing recurrence.

Example 1.4 : Sum of elements of one-dimensional real array using OpenMP PARALLEL FOR directive.
(Download source code : omp-sumof-elements.c / omp-sumof-elements.f )

Objective

Write an OpenMP program to find sum of elements of one-dimensional real array A using OpenMP PARALLEL FOR directive.

Description

Addition of all the elements of a real array is performed using OpenMP PARALLEL FOR and Critical directive. Since sum is a shared variable, its access must be synchronized amongst the thread team via the Critical directive. The critical directive specifies a region of code that must be executed by only one thread at a time.At the end the final sum is written into the sum variable.

Input

Number of threads and Size of an real array A.

Output

Sum of n elements of an array A and comparative results.

Example 1.5 : Sum of elements of an one-dimensional real array using OpenMP REDUCTION clause. (Download source code :
omp-sumof-elements-reduction.c / omp-sumof-elements-reductionop.f )

Objective

Write an OpenMP program to find sum of elements of one-dimensional real array A using OpenMP REDUCTION clause.

Description

Addition of all the elements of a real array A is performed using OpenMP PARALLEL FOR directive and REDUCTION clause. Reductions are a sufficiently common type of operation that OpenMP includes a reduction data scope clause just to handle the variable. In reduction, we repeatedly apply a binary operator to a variable and, and store the result back in the variable. In REDUCTION a private copy for each list variable is created for each thread. At the end of the reduction, the reduction variable is applied to all private copies of the shared variable, and the final result is written to the global shared variable. In this example we have added the clause REDUCTION ( + : Sum), which tells the compiler that Sum is the target of a sum reduction operation.

Input

Number of threads and Size of an real array A.

Output

Sum of n elements of an array .

Example 1.6 : Write a OpenMP program for Loop-carried dependence using OpenMP parallel Directive
(Download source code : omp-loop-carried-depend.c )

Objective

Write an OpenMP program to restructure the Loop-carried dependence to Loop-independent dependence using OpenMP parallel for pragma

Description

The objective in threading a loop is to convert independence loop iterations in threads to threads and run these in parallel. The aim is to restructure the loop to make sure that no loop carries no dependence before adding OpenMP pragmas. There are possibilities that compiler can parallelize the loop, the user must still ensure the loop is functionally correct.

In Multi-core Programming Environment, if the code has data dependencies, the compiler may or may not ignore due to presence of OpenMP pragmas. For example, the fragment of the code has two statements S(1) and S(2) which are executed in sequentially. Most commonly used execution path can be categorized into two ways. (1) : The statements S(1) and S(2) both reference the same memory location. (2) The execution of S(1) that references L occurs before the execution of S(2), that references L .

The example below shows loop-carried dependence. The values of the array i.e. X[0], & Y[1] are predefined. The code needs to be re-written to eliminate this dependency for OpenMP directives to give correct result.

    // The Loop carries out dependence example in the code

    X[0]=0;
    Y[1]=1;

    #pragma omp parallel for private (k);

        for ( k = 1; k < n ; k++)
       {
             x[k] = y[k-1] + 1; // S(1)
            y[k] = x[k-1] + 2; // S(2)
        }

Input

Sequential code with loop dependence statements with suitable value of n

Output

Correct results of the code & Performance of the program on Multiple cores

Example 1.7 : Write a OpenMP program to illustrate Data Race condition
(Download source code : omp-prime-datarace-condt.c )

Objective

Write OpenMP programs to illustrate Data Race condition and eliminate Data Race Condition using the private clause to the parallel for pragma

Description

Date-race Condition : Data-race condition occurs when multiple threads attempt to update the same memory location, or variable after threading. Also, one of the threads write the variable whereas other threads attempts to update the memory location and threads are not using any exclusive locks to control their accesses to that memory.

When the above three conditions hold, the order of accesses is non-deterministic, and the computation may give different results from one run to another, depending on that order. Some data races may be begin (for example, when the memory access is used for a busy-wait), but many data races are either bugs or caused by bugs in the programs The OpenMP C ++ & Fortran compilers do not perform or ignores the detection of data-race condition. Different types of bugs (synchronization & performance) may lead to undesired results. The synchronization bugs arise due to race conditions and deadlocks that cause incorrect behaviors of codes. To fix some of these bugs, either a trace buffer concept provided y tools can be used or re-write the code using privatization pragams of OpenMP. In some situations of the code behaviors, the code needs to be modified via different methods. The popular methods in OpenMP are use of privatization or synchronized using mechanisms like "mutuxes".

A simple example in which multiple threads are updating the variable x will lead to undesired or wrong results. In this, the distribution of work of given loop to each thread is done using pragma omp parallel for. It is observed that all the threads update the variable x concurrently and the results for multi-threads may produce in-correct results. The data-race condition can be eliminated for this loop by adding the private(x) clause to the parallel pragma. The application developer can avoid these issues, taking care-off threading contention and race conditions in mind. Several tools such as Performance Analyzer and Thread-Checker can be used.

    // The Loop carries out data-race condition exisits for variable x
    // The Developer can eliminate it by adding private (x) clause

    float tolerance = 250.75;
    int var = 75;

    #pragma omp parallel for

        for ( k = 1; k < n ; k++)
       {
             x = sinc (k*2.0) * var + 1;
            if ( x > tolerance) x = x % tolerance + 1 ;
            printf ( "x %d = %d \n ", k, x);
            A[k] = x *x ;
        }

Input

Number of threads and Upper bound to find the Prime Number.

Output

Correct results of the code & Performance of the program on Multiple cores

Example 1.8 : OpenMP program to illustrate Managing Shared & Private Data
(Download source code : omp-shared-private-data.c )

Objective

Write an OpenMP program to illustrate Managing Shared & Private Data

Description

It is important to understand which data is shared and which is private in multi-threaded programs not only from correctness of the results but also from performance point of view. The developer carefully examines all memory references, including the reference made by calls functions.

OpenMP provides a set of clauses such as shared , private , and default . The developer's responsibility is to identify clearly on which fragment of the code's memory should be private & which memory is shared among the threads. When the memory is identified as private , however a separate copy of the variable, is made for each thread to access in private. When the loop exists, these private copies become undefined. In multi-threaded programming environment, all the variables in parallel region are shared.

First, in parallel for loops, the loop index is private. In the next example, the k variable is private. Second variables that are local to the block of the parallel region are private. And third, any variables listed in the private, first private, lastprivate, or reduction clauses are private. The privatization is done by making a distinct copy of each of these variables for each thread.

A simple example given below requires declaration of private variables carefully in the loops. The following loop is very easy to execute using #pragma omp parallel and in this loop, the variable x is shared by all the threads based on OpenMP default shared rule, so there is a data race condition on the x while one thread is reading x , another thread might be writing to it. The execution of this loop may fails to give correct results.

    // The Loop carries out data-race condition exists for variable x
    // The Developer can eliminate it by adding private (x) clause

    #pragma omp parallel for

        y = 75;

        for ( k= 1; k < n ; k++)
       {
            int x; // Varaibles declared within a parallel
            x = array[k];
            array[k] = function(x, y) ;
            printf ( "array[k] %d = %d \n ", k, array[k]);
        }

One way of obtaining correct results is to declare the variable x a private memory. Another way is declare variables in the parallel construct region, which by default are private.

    // The Loop carries out data-race condition exists for variable x
    // The Developer can eliminate it by adding private (x) clause

    #pragma omp parallel for private (x)

        y = 75;

        for ( k = 1; k < n ; k++)
       {
            int x; // Varaibles declared within a parallel
            x = array[k];
            array[k] = function(x, y) ;
            printf ( "array[k] %d = %d \n ", k, array[k]);
        }

In the above, the variable x is declared inside the loop i.e., really inside the OpenMP parallel region - without static keyword. compiler and linker statically allocate the designated memory for static variables, but they are not truly private like other variables declared within a function, which are allocated within the stack frame for the function.

    // The Loop carries out data-race condition exists for variable x
    // The Developer can eliminate it by adding private (x) clause

    #pragma omp parallel for

        y = 75;

        for ( k = 1; k < n ; k++)
       {
            int x; // Varaibles declared within a parallel
                // construct area, by defintition, private

            array[k] = function(x, y) ;
            printf ( "array[k] %d = %d \n ", k, array[k]);
        }

Input

Number of threads and Size of an Array

Output

Correct results of the code & Performance of the program on Multiple cores.

Example 1.9 : Write a OpenMP program to illustrate Loop Scheduling & Partitioning - Optimize Performance
(Download source code : omp-matmat-schedule.c )

Objective

Write a OpenMP program to illustrate Loop Scheduling & Partitioning - Optimisze Performance

Description

Effective Loop scheduling and partitioning of data among threads is required for good load balancing and thereby achieving optimal performance on multi-cores.. One of the objectives is to ensure that the execution cores are busy with computation with minimum amount of overhead of scheduling, context switching and synchronization. Applications having static load balancing or work, it is possible to estimate the performance. In the case of dynamic workload scenario of applications, some threads my finish the work significantly before others. This leads to idleness of processor and thus wasting many CPU cycles. OpenMP provides four scheduling schemes, that are appropriate for many situations:

static, dynamic, runtime and guided.

It is important to estimate the workload among the loop iterations by examining the source code. Even though it is difficult & hard to determine the time taken to complete the work by each thread, but the code can be modified in such a way that compiler and runtime library can better partition and distribute the iterations of the loop across the threads, and therefore the cores, for optimal load balancing. Also, the loop scheduling information via the schedule (kind[, chunksize]) clause can further help the performance.

By default, an OpenMP parallel for or work-sharing for for loop uses static even scheduling. This means the iterations of a loop are distributed among the threads in a roughly equal number of iterations. The static scheduling scheme minimize the chances of memory conflict that can arise when more than one process is trying to access the piece of required memory. This is equivalent to use the optimal memory but it does not talk about the optimal load balancing. Thus, it is necessary to strike a balance between optimal memory usage and optimal load balancing by measuring performance to see what method produces the best results.

Loop-scheduling and partitioning information is conveyed to the compiler and runtime library on the openMP for construct with the schedule clause.

#pragma omp for schedule (kind {, chunksize})

The optimal choice of parameter chunk size and adjust the proper chunk-size play an important role to get performance. Improper chunk size may lead to overhead of going to work queue by each thread increases, thereby reducing performance, and possibly offsetting the benefits of load balancing. The dynamic scheduling, the chunks are handled with the first-come first server scheme. The default chunk size is set to 1. Each time, if the number of iterations are set equal to the chunk size specified in the scheduling clause for each thread. Usually, after the thread has finished executing the iterations given to it, it requests another set of chunk-size iterations. This continues until all of the iterations are completed. The last set of iterations may be less than the chunk size.

For example, if the chunk size is specified as 16 with the schedule (dynamic, 16) clause, and the total number of iterations 50, the partition would be 16, 16, 16, 2 with a total of four chunks.

    // The Developer should choose optimal chunk size as per System

    flaot x[1000], y[1000];

    #pragma omp parallel for schedule (dynamic, 8)

        for ( k = 1; k < n ; k++)
       {
             x[k] = cos(k) * x[k] + sin(k) *y(k);
        }

Input

Number of threads, Chunk Size and size of Matrices

Output

prints the time taken for the computation while using different method and different scheduling.

Example 1.10 : Write a OpenMP program to illustrate Work-Sharing Sections - Functional Parallelism (Download source code : omp-workshare-section.c )

Objective

Write a OpenMP program to illustrate Work-Sharing Sections (Functional Parallelism)

Description

OpenMP provides Data Parallelism and functional parallelism. Different threads can be associated with different portions of the code. There are some kinds of applications in which some portions of the code or functions can be executed in parallel to exploit the functional parallelism.

The parallel sections pragma precedes a block of k blocks of code that may be executed concurrently by K threads. The syntax is

#pragma omp parallel sections

Unlike loop scheduling, the schedule clause is not defined for sections . OpenMP is an complete control of how, when, and in what order threads are scheduled to execute the sections. The section pragma precedes a block of code within the encompassing block preceded by the parallel sections pragma.

The work-sharing section pragma (construct) directs the OpenMP compiler and runtime to distribute the identified sections of your application among threads in the team created for parallel region.

Below example uses work-sharing for loops and work-sharing sections together within a single parallel region. We considered, the calls to functions alpha, beta, gamma, delta could be evaluated concurrently. After completion of these, the remaining functions can be evaluated. It is possible that the program contains more sections than threads; the remaining sections get scheduled as threads finish previous sections.

    // The Developer should check the independent functions

    #pragma omp parallel
    {
        #pragma omp parallel
        for ( k = 1; k < n ; k++)
       {
    x = alpha(k) + beta(k);

        }

      #pragma omp sections private(y,z)
    {

      #pragma omp section
           { y = telda(x); gamma(y); }

      #pragma omp section
           { y = epsilon(x); delta (z); }
        }
    }

Input

Number of threads and size of matrices.

Output

Time taken by the parallel and serial computation

Example 1.11 : : Write a OpenMP program to illustrate the performance improvement (Download source code : omp-loop-invert.c )

Objective

Write a OpenMP program to illustrate the performance improvement of the parallel loops.

Description

Transforming a sequential for loop into a parallel for loop can actually increase a program's execution time. But there is some ways of improving the performance of the parallel loops.
The following code segment shows the data dependency in this code ;

for( i=0 ; i < m ; i++ )
for( j=0 ; j < n ; j++ )
a[i][j] = 2 * a [i-1][j]

The above code shows that the two rows may not be updated simultaneously, because there is data dependence between rows. However the columns may be updated simultaneously. This means the loop indexed by j may be executed in parallel , but not the loop indexed by i.

If the parallel for pragma insert before the inner loop , the resulting parallel program will execute correctly, but it may not exhibit good performance , because it will require m-1 fork/join steps,one per iteration of the outer loop.

However if the loops will invert :

#pragma parallel for private(i)
for( j=0 ; j < n ; j++ )
for( i=0 ; i < m ; i++ )
a[i][j] = 2 * a [i-1][j]

only a single fork/join step is required (surrounding the outer loop). The data dependences have not changed ; the iterations of the loop indexed by j are still independent of each other. In this respect the code may be improved.

However , we must always be cognizant of how code transformation affect the cache hit rate .

Input

Number of threads and size of matrix.

Output

Time taken for the computation.

Centre for Development of Advanced Computing