Page 23 - Contributed Paper Session (CPS) - Volume 7
P. 23
CPS2020 Honeylet T. S.
Aside from Poisson regression via Maximum Likelihood Estimation (MLE),
bootstrap methods will also be considered in the estimation of model
parameters. Efron (2000) argued that an advantage of bootstrapping is its
broad application. Moreover, Efron and Tibshirani (1986) discussed that the
bootstrap methods can be used in measuring statistical accuracy of estimators
even with more complicated forms. They showed in a sampling experiment
that bootstrap estimates for standard error of correlation coefficient, which
does not have a simple form, are nearly unbiased.
It is possible that a true model involves a count response variable predicted
by variables 1 and 2 that either have high correlation or low correlation.
However, a predicament wherein only one of the variables was observed in
the data sources may occur in practice. Hence, simulations in this study will
focus on cases in which only one of the common variables is available in the
data sources. Furthermore, other considerations that might take place in
practice will be taken into account.
These include total sample size of concatenated data sources, ratio of data
sources to total sample size, and effect of 1 and 2 on count variables to be
matched.
The objective of this study is to combine data from different sources
through statistical matching techniques with the end goal of developing a
count regression model. Specifically, this study aims to: (1) develop a statistical
matching technique to create synthetic count data, (2) estimate count
regression models based on synthetic data, and (3) characterize the estimation
procedure through simulation studies.
The matching and estimation procedure will be evaluated using absolute
relative bias (RBIAS) and mean absolute error (MAE). RBIAS will measure the
accuracy of the estimates obtained while MAE will measure the predictive
ability of the estimated model.
2. Methodology
Statistical matching problem involves integrating different data sources to
create a synthetic dataset. For the purpose of this study, two independent data
sources – Data Source A with A observations and Data Source B with B
observations – will be considered. Common variable is observed in both data
sources while specific variable is missing in Data Source A and specific
variable is missing in Data Source B. The data sources are random samples
from the same population. The assumption is that combining these two
independent data sources will yield a larger random sample Data Source A ∪
B with = A + B observations from the same population. Consequently, the
observation units in data Source A and data Source B are disjoint.
12 | I S I W S C 2 0 1 9