Page 358 - Contributed Paper Session (CPS) - Volume 6
P. 358
CPS1966 Jessa L. S. C. et al.
will be proposed, the performance of the models above using actual insurance
data set will be assessed and a scheme to match product type to client profile
will be proposed.
2. Methodology
The data is comprised of 2,758 actual records of policies sold from a
bancassurance company from 2014 to 2015. The data was up to time of
extraction hence variables that changed since policy effectivity was not
tracked. In actual setting it is possible for a client to purchase more than one
plan. To ensure uniqueness of the observations, only the first policy purchase
was kept in the data set. The first purchase also provides information on the
priority of the client in purchasing an insurance plan.
There are seventeen (17) independent variables namely Payment Mode,
Life Insurance Coverage in PhP, Status Indicator, Age of Insured, Age of Policy
Owner, Relationship Category, Owner Gender Indicator, Insured Gender
Indicator, Owner Income, No. of Sick Family Member, Medical History, Insured
Height, Insured Weight, X-ray Indicator, Indicators for Attached Accident,
Critical Illness, and Hospitalization Riders. The dependent variable is the Type
of Insurance (Unit Link, Endowment, and Other Traditional). These variables
are of several types - Nominal Categorical, Dichotomous, Continuous
Numerical, and Discrete Numerical. These also indicate demographic,
behavioral, and economic information about the insurance clients.
No variable considered contained missing data that is higher than 18% of
the observations. For those with missing data, imputation method that is
specific to each model was carried out. For CART, MLR, and RF, these are
surrogate splits, Multivariate Imputation by Chained Equations (MICE), and
Random Forest Imputation respectively. All of which and all evaluations
moving forward were done in R.
In the modeling process, observations from 2014 was treated as in-time
observations while those from 2015 was tagged as off-time. The 2014 data
was further divided into training and validation sets. The former was used in
the development of the model while the latter was used to fine tune the model
to avoid overfitting. For all models, the dependent variables were balanced
and distributed randomly between the training and validation data set.
To address potential overfitting, the 2014 data was grouped into the five
splits: 80%-20%, 70%-30%, 60%-40%, and 50%-50%, which represent the
splits between the training and validation data. Since CART performs cross-
validation, the splits were translated to 2 to 5 folds and prediction errors were
computed. The two other models were fitted on the training and validation
data. The discrepancy between the training’s and validation’s residual
deviance, and out-of-bag errors for MLR, and RF respectively were computed.
The split which generated the least discrepancy of the computed values was
347 | I S I W S C 2 0 1 9