Page 238 - Invited Paper Session (IPS)

Page 238 - Invited Paper Session (IPS) - Volume 1

P. 238

IPS 151 Rub´en C.
We describe below the implementation of the parallel algorithm using the
pthreads library. In this model all the data are already stored in global variables
or local variables of the functions that each processor executes are in the
memory of the computer. This produces a bottleneck because the number of
registers of the interface that communicates the processors with the memory
of the computer is limited. As a way to reduce the impact of the bottleneck
produced by the interface, a distributed version has been implemented using
the library mpich[3] that runs on a cluster of computers each with several
cores, in this case each process is executed in private memory.
In the case of implementation with GPU, the first feature to consider is that
the card has its own memory and the code that its processors execute can only
access this memory, so it is necessary to make data transfer options between
the computer’s memory and the memory of the card. The second feature is that
the code executed on the card has different characteristics than the code
executed on the computer. Since the functions executed on the card are not
capable of generating pseudo-random numbers, it is necessary to build a
sequence of these numbers with n timesb elements. A first alternative is to build
the sequence on the computer and then copy it to the memory of the card, but
fortunately there is a library called it curand that generates pseudo-random
numbers directly on the card.

Input: X, Y Set of n observations
b Number of bootstrap iterations

Output: P Pearson Coefficients
1 GenPseudoInCard(prand, n x b)
2 CopyToCard(X, Y, n)
3 PearsonInCard <<< b >>> (X, Y, n, prand, P)
4 CopyToHost(P, b)
5 return P
Figure 1: GPU Parallel Algorithm

Figure 1 shows the algorithm of the parallel version that uses the
graphics card. It is necessary to take into account that the memory space
required in the card is (2n + b + n b) 4 bytes which can be a
limitation.

__global__ void ComputeRhoInDev(float *v1, float *v2, float *prand,
float *rhodev, unsigned int n) {

unsigned int i, j, index;
float sx, sy, sx2, sy2, sxy, ro;

sx = sy = sx2 = sy2 = sxy = 0.0;

227 | I S I W S C 2 0 1 9

233 234 235 236 237 238 239 240 241 242 243