Page 367 - Contributed Paper Session (CPS) - Volume 6
P. 367

CPS1969 Janna M. De Veyra
            2.  Methodology
                In this paper two data sources were matched, one was labelled as source
            A composed of variables 1, 2 & y and the other one was labelled as source
            B composed of variables 1, 2, & z. The goal in this case is to test for the
            independence of y and z. The variables that are common in both sources which
            are 1 and 2 are essential in performing the test. The following steps are
            written to be able to provide a  clearer illustration on how the test will be
            conducted:
            Step 1: Z variable will be declared as missing in source A while y variable will
            be declared as missing in source B
            Step  2:  In  source  A,  identify  the  relationship  of  y  variable  in  1  and  2
            variables. The same goes with source B
            Step 3: Impute the missing values based on the values of 1 and 2 and on
            the relationship of the missing data to 1 and 2 obtained from Step 2.
            Step  4:  Combine  the  two  data  sources  so  the  total  sample  size  will  be
            n=+.
            Step 5: Compute for the Chi-square statistics of y and z in the combined data
            source to test for their independence

            2.1 Matching Methods
               This part discusses the different matching procedures that were considered
            in deriving a synthetic data.
            2.1.1 Regression Imputation
               Logistic regression will be used in matching two data sources where the
            variables  of  interest  are  categorical.  Imputed  values  will  be  based  on  the
            probability of success. In the case when the variable of interest is binary, the
            missing value will be imputed as 1 when the probability of success obtained
            from a logistic regression is greater than or equal to 0.5 and 0 otherwise. On
            the  other  hand,  when  the  variable  of  interest  contains  more  than  two
            categories, the probability of obtaining each category that is based on log-
            odds model will be computed and the missing value will be imputed based on
            the highest probability.
            2.1.2 Markov Chain Monte Carlo
               This type of procedure was used in the application of parameter estimation
            in the logistic distribution. In this paper, the MCMC sample from the posterior
            distribution of a logistic regression model using a random walk Metropolis
            algorithm.  This  type  of algorithm  is  an  iterative  algorithm  and  produces  a
            Markov Chain and permits empirical estimation of posterior distributions.
            2.1.3 Stochastic Imputation
               This procedure uses the estimates from the regression imputation and the
            markov chain monte carlo procedure in imputing missing value by adding a
            random error in the model. Imputation equation can be written as

                                                               356 | I S I   W S C   2 0 1 9
   362   363   364   365   366   367   368   369   370   371   372