Page 367 - Contributed Paper Session (CPS) - Volume 6
P. 367
CPS1969 Janna M. De Veyra
2. Methodology
In this paper two data sources were matched, one was labelled as source
A composed of variables 1, 2 & y and the other one was labelled as source
B composed of variables 1, 2, & z. The goal in this case is to test for the
independence of y and z. The variables that are common in both sources which
are 1 and 2 are essential in performing the test. The following steps are
written to be able to provide a clearer illustration on how the test will be
conducted:
Step 1: Z variable will be declared as missing in source A while y variable will
be declared as missing in source B
Step 2: In source A, identify the relationship of y variable in 1 and 2
variables. The same goes with source B
Step 3: Impute the missing values based on the values of 1 and 2 and on
the relationship of the missing data to 1 and 2 obtained from Step 2.
Step 4: Combine the two data sources so the total sample size will be
n=+.
Step 5: Compute for the Chi-square statistics of y and z in the combined data
source to test for their independence
2.1 Matching Methods
This part discusses the different matching procedures that were considered
in deriving a synthetic data.
2.1.1 Regression Imputation
Logistic regression will be used in matching two data sources where the
variables of interest are categorical. Imputed values will be based on the
probability of success. In the case when the variable of interest is binary, the
missing value will be imputed as 1 when the probability of success obtained
from a logistic regression is greater than or equal to 0.5 and 0 otherwise. On
the other hand, when the variable of interest contains more than two
categories, the probability of obtaining each category that is based on log-
odds model will be computed and the missing value will be imputed based on
the highest probability.
2.1.2 Markov Chain Monte Carlo
This type of procedure was used in the application of parameter estimation
in the logistic distribution. In this paper, the MCMC sample from the posterior
distribution of a logistic regression model using a random walk Metropolis
algorithm. This type of algorithm is an iterative algorithm and produces a
Markov Chain and permits empirical estimation of posterior distributions.
2.1.3 Stochastic Imputation
This procedure uses the estimates from the regression imputation and the
markov chain monte carlo procedure in imputing missing value by adding a
random error in the model. Imputation equation can be written as
356 | I S I W S C 2 0 1 9