Page 351 - Special Topic Session (STS) - Volume 2
P. 351
STS498 Wei-Yin L.
The GUIDE approach
Wei-Yin Loh
University of Wisconsin, Madison, WI, USA
Abstract
GUIDE is an algorithm for constructing classification and regression tree
models. The talk discusses the basic problem of accuracy versus
interpretability and the specific features that the GUIDE approach offers. Its
unique features are demonstrated through two real examples.
Keywords
Classification and regression trees; interpretable models; missing values;
prediction accuracy
1. Introduction
There is a general sense that machine learning models with high prediction
accuracy are not interpretable. A large part of this belief is due to methods
that are inherently uninterpretable to start with, but which can be tuned to
achieve arbitrarily high levels of prediction accuracy on the same data that are
used for their construction. When such prediction accuracy is based on
independent test samples, studies have shown that there is typically no
method with uniformly highest prediction accuracy (Lim et al., 2000).
Classification and regression tree models are inherently interpretable but
are often regarded as not having as high prediction accuracy as other methods
such as neural nets. This talk reviews the GUIDE (Loh, 2002, 2009, 2014)
algorithm which has special features for enhancing interpretability, such as
unbiased variable selection. GUIDE also stands apart from many other tree
algorithms in being able to seamlessly deal with missing data, a practical
difficulty that most machine learning methods are not designed for. Two
examples are given below to illustrate these features.
2. Consumer expenditure data
The data come from the 2013 Consumer Expenditure Survey of the U.S.
Bureau of Labor Statistics. It contains information on consumers’ expenditures
and incomes as well as characteristics of 2838 consumer units (CUs) on more
than 600 questions. Table 1 gives the names and percents of missing values in
some of the variables. Details on the survey may be found in Bureau of Labor
Statistics (2016, Chap. 6). We use INTRDVX as the dependent variable; it is the
amount of interest and dividend income received by the CU during the past
340 | I S I W S C 2 0 1 9