Page 130 - Contributed Paper Session (CPS) - Volume 4
P. 130
CPS2156 Luis Sanguiao Sande
Bias removal through sampling in machine
learning models
Luis Sanguiao Sande
Spanish NSI (INE)
Abstract
It is well known that machine learning models have some bias, consequence
of the bias-variance tradeoff and the optimization of the square mean error.
For aggregates of the output variable(s), a probabilistic sample can be used to
correct the bias, but we need this second sample in addition to the training
sample. We propose an estimator that uses just one probabilistic sample both
for modelling and bias removal. Two examples show that bias is indeed
removed. In one of the examples the variance increases notably (this increase
is almost exactly compensated by the bias removal) but in the other one, it
unexpectedly decreases. This suggests that this kind of methods might be
useful to combine with machine learning algorithms when used to estimate
aggregates of predicted variables.
Keywords
Machine learning; bias correction; sampling; random forest; bias-variance
tradeoff
1. Introduction
Suppose we have a finite population = {1,2,… , } , a set of features for
each unit = {1,… , } and we want to model some variable . Let = {1,…
, } be a sample with known sampling design , and where is supposed to
be known. A machine learning algorithm maps any sample to a function
= ()() for each ∈ . If the predictors are known, we can estimate the
totals of as
= ∑ ≅ ∑ + ∑ ̂
=1 ∉
It might be a very good prediction, but it is biased because of the model.
If we are reasonably sure that will not change, we can sample once again the
population and obtain an unbiased estimation of the bias, and thus an
unbiased estimation of . But might have changed or the costs of a second
sampling might be too expensive. Another option, closer to our line, would be
to use the GREG estimator [6], but it is unbiased only asymptotically.
The method proposed for bias removal, inspired in cross-validation [4],
divides the original sample into two subsets (equivalent to training and
validation sets) and uses the first one for modeling and the second one for
119 | I S I W S C 2 0 1 9