Page 27 - Special Topic Session (STS) - Volume 1
P. 27

STS346 A.H.M. Rahmatullah Imon
            focuses  on  both  x  and y  dimensions at  the  same  time,  but  also considers
            classification techniques to identify outliers.

            2.  Methodology
            Consider a standard regression model

                            =  +                                                           (1)

            where  Y  is  an  n-vector  of  observed  responses,  X  is  an   ×   matrix
            representing p explanatory variables including the constant,  is a p-vector of
            unknown finite parameters and  is an n-vector of random disturbances with
                                  2
            () = 0 and () =  . The traditionally used Ordinary Least Squares (OLS)
                                                                              ̂
                                                                                    ̂
                             ̂
                                       −1
                                   
                                           
            estimator of  is  = ( )   and the vector of fitted values is  =  =
            . The matrix

                                          
                                      −1
                            = ( )                                                         (2)
                                   

            is often referred to as weight or leverage matrix whose diagonal elements ℎ
                                                                                      
                                                                                     ̂
            are  termed  leverages.  The  OLS  residual  vector ̂ is  defined  as   ̂ =  − .
            Observations corresponding to exceptionally large ̂ values are termed
            outliers. However, the question still remains how large is large? For this reason
            we  often  consider  the standardized  version of  residuals.  One  very  popular
            choice is deleted Studentized residuals (DSR) defined as

                                   
                                  −  ̂ (−)
                            =         ,  = 1,2, … ,                                 (3)
                            
                                ̂ () √(1−ℎ  )

            Where  ̂ (−)  and ̂ (−)  are the OLS estimates of  and  respectively with the
             − ℎ  observation  deleted.  We  call  an  observation  outlier  when  its
            corresponding deleted Studentized residual value exceeds 3 in absolute value.
            Observations corresponding to exceptionally large ℎ  values are termed high
                                                               
            leverage  points  which  are  essentially  outliers  and  high  leverage  points
            simultaneously rather than separately. Gray (1986) proposed the Leverage-
            Residual ( − ) plot where the leverage value ℎ  for each observation  is
                                                             
            plotted against the square of a normalised form of its corresponding residual.
            The bulk of the cases will be associated with low leverage and small residuals
            so that they cluster near the origin (0,0). The unusual cases will have either
            high leverages or large residual components and so will tend to be separated
            from the bulk of the data. High leverage cases will be located in the upper area
            of the plot and observations with large residuals will be located in the area to
            the right.



                                                                16 | I S I   W S C   2 0 1 9
   22   23   24   25   26   27   28   29   30   31   32