Page 238 - Invited Paper Session (IPS) - Volume 1
P. 238

IPS 151 Rub´en C.
                      We describe below the implementation of the parallel algorithm using the
                  pthreads library. In this model all the data are already stored in global variables
                  or  local  variables  of  the  functions  that  each  processor  executes  are  in  the
                  memory of the computer. This produces a bottleneck because the number of
                  registers of the interface that communicates the processors with the memory
                  of the computer is limited. As a way to reduce the impact of the bottleneck
                  produced by the interface, a distributed version has been implemented using
                  the library mpich[3]  that runs on a  cluster of computers each with several
                  cores, in this case each process is executed in private memory.
                      In the case of implementation with GPU, the first feature to consider is that
                  the card has its own memory and the code that its processors execute can only
                  access this memory, so it is necessary to make data transfer options between
                  the computer’s memory and the memory of the card. The second feature is that
                  the  code  executed  on  the  card  has  different  characteristics  than  the  code
                  executed on the computer. Since the functions executed on the card are not
                  capable  of  generating  pseudo-random  numbers,  it  is  necessary  to  build  a
                  sequence of these numbers with n timesb elements. A first alternative is to build
                  the sequence on the computer and then copy it to the memory of the card, but
                  fortunately there is a library called it curand that generates pseudo-random
                  numbers directly on the card.

                             Input:     X, Y Set of n observations
                                        b Number of bootstrap iterations

                             Output:    P Pearson Coefficients
                             1          GenPseudoInCard(prand, n x b)
                             2          CopyToCard(X, Y, n)
                             3          PearsonInCard <<< b >>> (X, Y, n, prand, P)
                             4          CopyToHost(P, b)
                             5          return P
                                 Figure 1: GPU Parallel Algorithm

                         Figure 1 shows the algorithm of the parallel version that uses the
                  graphics card. It is necessary to take into account that the memory space
                  required in the card is (2n + b + n    b)    4 bytes which can be a
                  limitation.

                  __global__ void ComputeRhoInDev(float *v1, float *v2, float *prand,
                                                 float *rhodev, unsigned int n) {

                      unsigned int i, j, index;
                      float sx, sy, sx2, sy2, sxy, ro;

                      sx = sy = sx2 = sy2 = sxy = 0.0;


                                                                     227 | I S I   W S C   2 0 1 9
   233   234   235   236   237   238   239   240   241   242   243