Page 87 - Contributed Paper Session (CPS) - Volume 2
P. 87

CPS1440 Avner Bar-Hen et al.
            cells defined by the partition are a realization of a spatial point process, where
            the response variable Y can be seen as a mark of the point of position given
            by X. Following this idea, the variant of the CART algorithm that we propose
            takes into account the spatial dimension of the marked processes, and thus
            makes  it  possible  to  provide  the  user  a  partition  of  the  plan  defining
            homogeneous areas of interaction between the two marks.
                While  a  classical  assumption  for  CART  is  to  consider  a  sample  of  i.i.d.
            observations, our variant takes into account spatial dependence through the
            quantification of the interaction between the two considered populations, i.e.
            the link between the labels of the points. Indeed, if CART is often limited to
            the one most commonly used and implemented, presented in [3], there exist
            several  ways  to  build  CART  type  decision  trees,  by changing  the  family  of
            admissible splits, the cost function or the stopping rule. We consider here the
            spatial dependence using the Ripley's intertype K-function to suitably redefine
            the splitting criterion in CART. Instead of being based only on the respective
            proportions of the marks, the splits will be selected according to the strength
            of interaction between the two populations.
                Among the various extensions of CART, Bel et al. ([2]) proposed a variant
            for spatial data, but in the regression context. They consider the observations
            as a sample of regionalized variables, which can be modeled as random fields
            with  spatial  dependence.  They  propose  various  reweighting  schemes  to
            equalize the contribution of each surface unit for the choices of splits. It is not
            adapted to the classification case.

            2.  Methodology

            a.  Classical Classi_cation And Regression Trees (CART)
                The  data  are  considered  as  an  independent  sample  of  the  random
                       1
            variables (X ,...,X ,Y ), where the X s are the explanatory variables and Y is the
                                             k
                            p
            categorical response variable. CART is a rule-based method that generates a
            binary tree through recursive partitioning that splits a subset (called a node)
            of  the  data  set  into  two  subsets  (called  sub-nodes)  according  to  the
            minimization  of  a  heterogeneity  criterion  computed  on  the  resulting  sub-
            nodes.  Each  split  involves  a  single  variable.  Some  variables  may  be  used
            several times while others may not be used at all.
                Let us consider a decision tree T. When Y is a categorical variable a class
            label is assigned to each terminal node (or leaf) of T. Hence T can be viewed
            as a mapping to assign a value = ( , … … ,  ) to each observation.
                                                           
                                          ̂
                                                  1
                                           
                                                  
                                                          
                The growing step leading to a deep maximal tree is obtained by recursive
            partitioning of the training sample by selecting the best split at each node
            according to some heterogeneity index, such that it is equal to 0 when there
            is only one class represented in the node to be split, and is maximum when all
                                                                76 | I S I   W S C   2 0 1 9
   82   83   84   85   86   87   88   89   90   91   92