Page 40 - Special Topic Session (STS) - Volume 1
P. 40

STS346 Abu Sayed M. et al.
                  situations. In case 1, 5 low leverage cases (10%) are replaced by high leverage
                  points. In case 2 and case 3 we replace 20% and 30% low leverage points by
                  points of high leverages respectively. Now we compute the leverage values for
                  this data set. Here the cut-off point for rule 1 and 3 is 0.08 and for rule 2 and
                  4 is 0.12 as they were before. The cut-off points for rule 5 are 0.1052, 0.1024,
                  and  0.0845  for  10%,  20%  and  30%  high  leverage  points  respectively.  We
                  observed from that for the 10% contamination, the traditional leverage values
                  w ˆ  can identify high leverage points successfully, but their performances tend
                    ii
                  to  deteriorate  with  the  increase  in  the  level  of  contamination.  For  20%
                  contamination it fails to identify 4 high leverage cases out of 10 and for the
                  30% contamination it fails to identify 10 out of 15 high leverage points. The
                                                      ~
                  newly proposed leverage measures  w  perform very well in this regard. All
                                                        ii
                  high  leverage  points  are  successfully  identified  irrespective  of  the  level  of
                  contamination.
                     In this section we report a Monte Carlo simulation which is designed to
                  investigate  the  performances  of  different  measures  of  leverages  in  linear
                  functional relation model. For four different sample sizes, n = 20, 30, 50 and
                  100, we generated the X values from Uniform (20, 40). Here we consider three
                  different percentages, i.e., 10%, 20%, and 30% high leverage points. The  X
                  value corresponding to the lowest high leverage value is then set at 100 and
                  the next values have an increment of 5 each. To generate a model like (2), we
                  then define  x   X   i , where   is N (0, 1). The values of  y  are generated
                                     i
                               i
                                                                              i
                                                   i
                  as   y  20  2 X   , where   is also N (0, 1). For each different sample we
                                  i
                       i
                                                 i
                                       i
                  apply all five leverage identification rules mentioned in section 4 and compute
                  the  correct  identification  rate  (IR)  and  the  swamping  rate  (SR)  in  terms  of
                  percentages. We run 10,000 simulations for each combination.  When no high
                  leverage point exists, we observe from the above table that for n = 20, all
                  methods considered in the simulation perform well. However, rule 1, i.e., the
                  traditional leverage measure based on the 2M rule has about 5% swamping
                  rate. The newly proposed rule 4 performs the best as its swamping rate is the
                  lowest followed by rule 2, rule 5 and rule 3. The performance of all these rules
                  tend to improve with the increase in sample sizes but still rule 1 has relatively
                  very high swamping rate which clearly shows that the 2M rule is too prone to
                  declare low leverage points as points of high leverages. In case of 10% high
                  leverages, almost all methods perform very well. Each method maintains 100%
                  identification  rate  with  low  swamping  rate.  Only  when  n  =  100,  the
                  identification rate for rule 2 is 90%.  But when 20% or 30% high leverage points
                  are present in the data both the 2M and the 3M rule break down. The rule 2,
                  i.e., the 3M performs worst as often its correct identification rate is 0%. The
                  performance of the rule 1 is also poor as it can identify around 13% cases


                                                                      29 | I S I   W S C   2 0 1 9
   35   36   37   38   39   40   41   42   43   44   45