Page 354 - Special Topic Session (STS) - Volume 2
P. 354

STS498 Wei-Yin L.
                  predictor variables to fit a model to P(Y = 1). Some of the variables are listed
                  in Table 3.
                  Table  3:  Some  of  the  109  predictor  variables  in  the  crash-test  dataset.  Angular
                  variables PDOF, and IMPANG are measured in degrees clockwise (from -179 to 180)
                  with 0 being front of car.
                   Name     Description                        Variable type   Percent missing
                   MAKE     Vehicle manufacturer               71 categories               0
                   MODEL  Vehicle model                        642 categories              0
                   YEAR     Vehicle model year                 continuous               0.001
                   BODY     Vehicle body type                  19 categories               0
                   ENGINE  Engine type                         18 categories               0
                   ENGDSP  Engine displacement                 continuous               0.007
                   TRANSM  Transmission type                   9 categories             0.002
                   VEHTWT  Vehicle test weight                 continuous               0.001
                   VEHWID  Vehicle width                       continuous               0.027
                   VEHCG    Vehicle CG distance from front axle   continuous            0.024
                   COLMEC  Steering column collapse mechanism  9 categories             0.076
                   VEHSPD  Speed of vehicle before impact      continuous 0
                   PDOF     Principal direction of force       continuous               0.007
                   TKSURF  Test track surface                  5 categories             0.024
                   TKCOND  Test track condition                6 categories             0.024
                   IMPANG  Impact angle                        continuous                  0
                   OCCTYP   Occupant type                      13 categories               0
                   DUMSIZ  Dummy size percentile               8 categories                0
                   SEPOSN  Seat position                       6 categories             0.025
                   BARRIG  Rigid or deformable barrier         3 categories                0
                   BARSHP  Barrier shape                       21 categories               0

                     One thousand two hundred and eleven of the records are missing one or
                  more data values. Therefore a linear logistic regression using all the variables
                  can be fitted only to the subset of 14,730 records that have complete values.
                  After transforming each categorical variable into a set of indicator variables,
                  the model has 561 regression coefficients, including the constant term. All but
                  six variables (ENGINE, VEHWID, TKCOND, IMPANG, RSTTYP, and BARRIG) are
                  statistically significant. But the regression coefficients in the model cannot be
                  relied upon to explain how each variable affects p = P(Y = 1). For example,
                  although VEHSPD is highly significant in this model, it is not significant in a
                  simple linear logistic model that employs it as the only predictor. This is an
                  example of Simpson’s paradox. It occurs when a variable has an effect in the
                  same direction within subsets of the data, but when the subsets are combined,
                  the effect vanishes or reverses in direction.
                     Figure 2 shows the GUIDE logistic regression tree model, where a single
                  predictor variable is used to fit a simple linear logistic regression model in each


                                                                     343 | I S I   W S C   2 0 1 9
   349   350   351   352   353   354   355   356   357   358   359