Page 133 - Contributed Paper Session (CPS) - Volume 4
P. 133
CPS2156 Luis Sanguiao Sande
implementation. Fortunately, for bootstrap aggregated [1] algorithms (like
random forests) and simple sampling designs, we get an approximated version
based on the out of bag predictions.
Theorem B: Let = ()() the predictions when ∉ and the out of bag
predictions when ∈ . Let be the out of bag errors. Under simple (possibly
stratified) design, the estimator
is an approximation of an unbiased second stage based estimator for .
Proof. See [5].
This way the estimator is expressed as the sum of a purely model based
expression and a sampling based estimation of the bias. Note that this result
is also kind of a confirmation that the out of bag errors are a good indicator
of the performance of a bootstrap aggregated algorithm. An unbiased
estimator for the variance is also known.
Proposition C: Let be a 2 unbiased estimator for the variance of . A
̂
̂
1
1
unbiased estimator of the variance of is
̂
2
Proof. See [5].
Note that if we want to build the estimator we have to be able to
̂
2
estimate the variance of . Thus measurable sampling design is required at
̂
1
second stage in our two stage decomposition. In the examples only one unit
is sampled in second stage, so there is no way we can build the estimator. If
we wanted to estimate the variance, we should take − 2 elements at stage
one and 2 elements at stage two. For random forest, out of bag predictions
excluding two elements would be needed, but we do not know any piece of
software that provides such predictions.
3. Result
We are comparing the estimator from Theorem B with the pure model
based estimator in two very different populations. The first one is based on
synthetic data that is constructed to hold a (noisy) equality, and the second
one is based on real data (and therefore, it holds no equality). In both cases a
big number of samples (10000) are taken, and the estimations are compared
to the real (known) aggregated values of target variables. The bias is estimated
as the mean of the 10000 estimations minus the real totals of the target
variable. The variance is estimated as the variance of the 10000 estimations
and the square mean error is estimated as the mean of the square of the
difference between each estimation and the real totals.
122 | I S I W S C 2 0 1 9