Page 358 - Contributed Paper Session (CPS) - Volume 6
P. 358

CPS1966 Jessa L. S. C. et al.
                  will be proposed, the performance of the models above using actual insurance
                  data set will be assessed and a scheme to match product type to client profile
                  will be proposed.

                  2.  Methodology
                      The  data  is  comprised  of  2,758  actual  records  of  policies  sold  from  a
                  bancassurance  company  from  2014  to  2015.  The  data  was  up  to  time  of
                  extraction  hence  variables  that  changed  since  policy  effectivity  was  not
                  tracked. In actual setting it is possible for a client to purchase more than one
                  plan. To ensure uniqueness of the observations, only the first policy purchase
                  was kept in the data set. The first purchase also provides information on the
                  priority of the client in purchasing an insurance plan.
                      There are seventeen (17) independent variables namely Payment Mode,
                  Life Insurance Coverage in PhP, Status Indicator, Age of Insured, Age of Policy
                  Owner,  Relationship  Category,  Owner  Gender  Indicator,  Insured  Gender
                  Indicator, Owner Income, No. of Sick Family Member, Medical History, Insured
                  Height,  Insured  Weight,  X-ray  Indicator,  Indicators  for  Attached  Accident,
                  Critical Illness, and Hospitalization Riders. The dependent variable is the Type
                  of Insurance (Unit Link, Endowment, and Other Traditional). These variables
                  are  of  several  types  -  Nominal  Categorical,  Dichotomous,  Continuous
                  Numerical,  and  Discrete  Numerical.  These  also  indicate  demographic,
                  behavioral, and economic information about the insurance clients.
                      No variable considered contained missing data that is higher than 18% of
                  the  observations.  For  those  with  missing  data,  imputation  method  that  is
                  specific  to  each  model was  carried  out.  For  CART,  MLR, and  RF,  these are
                  surrogate splits, Multivariate Imputation by Chained Equations (MICE),  and
                  Random  Forest  Imputation  respectively.  All  of  which  and  all  evaluations
                  moving forward were done in R.
                      In the modeling process, observations from 2014 was treated as in-time
                  observations while those from 2015 was tagged as off-time. The 2014 data
                  was further divided into training and validation sets. The former was used in
                  the development of the model while the latter was used to fine tune the model
                  to avoid overfitting. For all models, the dependent variables were balanced
                  and distributed randomly between the training and validation data set.
                      To address potential overfitting, the 2014 data was grouped into the five
                  splits:  80%-20%,  70%-30%,  60%-40%,  and  50%-50%,  which  represent  the
                  splits between the training and validation data. Since CART performs cross-
                  validation, the splits were translated to 2 to 5 folds and prediction errors were
                  computed. The two other models were fitted on the training and validation
                  data.  The  discrepancy  between  the  training’s  and  validation’s  residual
                  deviance, and out-of-bag errors for MLR, and RF respectively were computed.
                  The split which generated the least discrepancy of the computed values was

                                                                     347 | I S I   W S C   2 0 1 9
   353   354   355   356   357   358   359   360   361   362   363