Page 351 - Special Topic Session (STS) - Volume 2
P. 351

STS498 Wei-Yin L.





                                       The GUIDE approach
                                            Wei-Yin Loh
                                 University of Wisconsin, Madison, WI, USA

            Abstract
            GUIDE  is  an  algorithm  for  constructing  classification  and  regression  tree
            models.  The  talk  discusses  the  basic  problem  of  accuracy  versus
            interpretability and the specific features that the GUIDE approach offers. Its
            unique features are demonstrated through two real examples.

            Keywords
            Classification  and  regression  trees;  interpretable  models;  missing  values;
            prediction accuracy

            1.  Introduction
                There is a general sense that machine learning models with high prediction
            accuracy are not interpretable. A large part of this belief is due to methods
            that are inherently uninterpretable to start with, but which can be tuned to
            achieve arbitrarily high levels of prediction accuracy on the same data that are
            used  for  their  construction.  When  such  prediction  accuracy  is  based  on
            independent  test  samples,  studies  have  shown  that  there  is  typically  no
            method with uniformly highest prediction accuracy (Lim et al., 2000).
                Classification and regression tree models are inherently interpretable but
            are often regarded as not having as high prediction accuracy as other methods
            such  as  neural  nets.  This  talk  reviews  the  GUIDE  (Loh,  2002,  2009,  2014)
            algorithm which has special features for enhancing interpretability, such as
            unbiased variable selection. GUIDE also stands apart from many other tree
            algorithms  in  being  able  to  seamlessly  deal  with  missing  data,  a  practical
            difficulty  that  most  machine  learning  methods  are  not  designed  for.  Two
            examples are given below to illustrate these features.

            2.  Consumer expenditure data
                The data come from the 2013 Consumer Expenditure Survey of the U.S.
            Bureau of Labor Statistics. It contains information on consumers’ expenditures
            and incomes as well as characteristics of 2838 consumer units (CUs) on more
            than 600 questions. Table 1 gives the names and percents of missing values in
            some of the variables. Details on the survey may be found in Bureau of Labor
            Statistics (2016, Chap. 6). We use INTRDVX as the dependent variable; it is the
            amount of interest and dividend income received by the CU during the past


                                                               340 | I S I   W S C   2 0 1 9
   346   347   348   349   350   351   352   353   354   355   356