Page 130 - Contributed Paper Session (CPS) - Volume 4
P. 130

CPS2156 Luis Sanguiao Sande



                                 Bias removal through sampling in machine
                                               learning models
                                             Luis Sanguiao Sande
                                                 Spanish NSI (INE)

                  Abstract
                  It is well known that machine learning models have some bias, consequence
                  of the bias-variance tradeoff and the optimization of the square mean error.
                  For aggregates of the output variable(s), a probabilistic sample can be used to
                  correct the bias, but we need this second sample in addition to the training
                  sample. We propose an estimator that uses just one probabilistic sample both
                  for  modelling  and  bias  removal.  Two  examples  show  that  bias  is  indeed
                  removed. In one of the examples the variance increases notably (this increase
                  is almost exactly compensated by the bias removal) but in the other one, it
                  unexpectedly decreases. This suggests that this kind of methods might be
                  useful to combine with machine learning algorithms when used to estimate
                  aggregates of predicted variables.

                  Keywords
                  Machine  learning;  bias  correction;  sampling;  random  forest;  bias-variance
                  tradeoff

                  1.  Introduction
                     Suppose we have a finite population  = {1,2,… , } , a set of features for
                  each unit  = {1,… ,  } and we want to model some variable . Let   = {1,…
                  , } be a sample with known sampling design , and where  is supposed to

                  be known. A machine learning algorithm  maps any sample to a function  
                  = ()() for each  ∈ . If the predictors are known, we can estimate the
                  totals of  as
                                               
                                           = ∑    ≅ ∑   + ∑ ̂
                                                                     
                                                            
                                              =1         ∉
                     It might be a very good prediction, but it is biased because of the model.
                  If we are reasonably sure that  will not change, we can sample once again the
                  population  and  obtain  an  unbiased  estimation  of  the  bias,  and  thus  an
                  unbiased estimation of . But  might have changed or the costs of a second
                  sampling might be too expensive. Another option, closer to our line, would be
                  to use the GREG estimator [6], but it is unbiased only asymptotically.
                     The method proposed for bias removal, inspired in cross-validation [4],
                  divides  the  original  sample  into  two  subsets  (equivalent  to  training  and
                  validation sets) and uses the first one for modeling and the second one for

                                                                     119 | I S I   W S C   2 0 1 9
   125   126   127   128   129   130   131   132   133   134   135