Objective
Write a MPI - CUDA program, for computing the matrix -vector
To write a Pthreads - Cuda program to compute the matrix-vector multiplication
using block striped partitioning of a matrix for uniform data distribution.
Description
This is an implementation of Matrix-Vector multiplication
using the block striped partitioning algorithm. Each thread
multiplies the corresponding elements of the matrix with the
vector by calling the CUDA kernel MatrixVectorMultiplication
and writes the result into the result vector. A Mutex is used
on the result vector to guarantee atomicity. The thread accesses
the elements based on its id which is allocated by the main thread
in the order of their creation. As the number of threads and the number
of elements is known, the corresponding elements to be accessed can easily
be computed.
Implementation of Matrix Vector Multiplication :
Step 1 : Three Arrays MatrixA , VectorB and
ResultVector are declaed on . Two additional vectors
are used on host-CPU for computation. The purpose is
to to store the part of the matrix that is accessible
by each thread and the other is to store part of the result vector computed by each thread.
Step 2 : Main thread initializes the arrays MatrixA , VectorB and
ResultVector .
Fill arrays MatrixA and VectorB with single precsion real values and
initialize the array ResultVector.
Step 3 : ThreadPart which decides how many elements each thread is to access in the matrix is computed.
Step 4 : Threads creation is done.
Step 5 : Memory is allocated for MyMatrixA and MyResultVector by all the threads on host-CPU .
Step 6 : MyMatrixA is constructed from MatrixA depending on the value of thread Id host-CPU
Step 7 : Memor for Vectors are allocated on host-CPU . The values of the vectors
in the host machine are copied to the vectors on the device machine.
Step 8 :Checking for the availability of device is done on host-CPU
Step 9 : Each thread computes the Matrix vector multiplication by calling the CUDA kernel
MatrixVectorMultiplication and copies back the result to MyResultVector on host-CPU .
Step 10 : Each thread performs a mutex operation on the result vector and assigns the corresponding elements
on host-CPU.
Step 11 : Main thread waits for all threads to exit and
Main thread prints the resultant vector.
CUDA API used :
To Allocate memory on device-GPU :
cudaMalloc(void** array, int size)
To Free memory allocated on device-GPU:
cudaFree(void* array)
To transfer from host-CPU to device-GPU:
cudaMemcpy((void*)device_array, (void*)host_array, size, cudaMemcpyHostToDevice)
To transfer from device-GPU to host-GPU:
cudaMemcpy((void*)host_array, (void*)device_array, size, cudaMemcpyDeviceToHost)
Input
The input to the problem is given as arguments in the command line.
It should be given in the following format.
Suppose the dimension of the matrix is mxn, size of the vector is n
and the number of threads is p.
./Mat_Vect_Mult_Pthreds_CUDA m n n p
Main thread generates the Matrix and the vector.
Output
Main thread prints the final vector and the resultant vector
|