Page 133 - Contributed Paper Session (CPS) - Volume 4
P. 133

CPS2156 Luis Sanguiao Sande
            implementation.  Fortunately,  for  bootstrap  aggregated  [1]  algorithms  (like
            random forests) and simple sampling designs, we get an approximated version
            based on the out of bag predictions.
            Theorem B: Let   = ()() the predictions when  ∉  and the out of bag
            predictions when  ∈ . Let   be the out of bag errors. Under simple (possibly
            stratified) design, the estimator




            is an approximation of an unbiased second stage based estimator for .
            Proof. See [5].
                This way the estimator is expressed as the sum of a purely model based
            expression and a sampling based estimation of the bias. Note that this result
            is also kind of a confirmation that the out of bag errors are a good indicator
            of  the  performance  of  a  bootstrap  aggregated  algorithm.  An  unbiased
            estimator for the variance is also known.
            Proposition C: Let   be a 2 unbiased estimator for the variance of   . A
                                ̂
                                                                                 ̂
                                1
                                                                                  1
             unbiased estimator of the variance of   is
                                                   ̂
                                                    2

            Proof. See [5].
                Note  that  if  we  want  to  build  the  estimator   we  have  to  be  able  to
                                                              ̂
                                                              2
            estimate the variance of  . Thus measurable sampling design is required at
                                     ̂
                                      1
            second stage in our two stage decomposition. In the examples only one unit
            is sampled in second stage, so there is no way we can build the estimator. If
            we wanted to estimate the variance, we should take  − 2 elements at stage
            one and 2 elements at stage two. For random forest, out of bag predictions
            excluding two elements would be needed, but we do not know any piece of
            software that provides such predictions.

            3.  Result
                We are comparing the estimator from Theorem B with the pure model
            based estimator in two very different populations. The first one is based on
            synthetic data that is constructed to hold a (noisy) equality, and the second
            one is based on real data (and therefore, it holds no equality). In both cases a
            big number of samples (10000) are taken, and the estimations are compared
            to the real (known) aggregated values of target variables. The bias is estimated
            as  the  mean  of  the  10000  estimations  minus  the  real  totals  of  the  target
            variable. The variance is estimated as the variance of the 10000 estimations
            and the square mean  error  is estimated as  the mean  of the square of the
            difference between each estimation and the real totals.



                                                               122 | I S I   W S C   2 0 1 9
   128   129   130   131   132   133   134   135   136   137   138