Page 227 - Contributed Paper Session (CPS) - Volume 4
P. 227

CPS2203 Thierry D. et al.





                            Model Selection for Covariates Clustering
                                 Thierry Dumont, Ana Karina Fermin
                                        Université Paris Nanterre

            Abstract
            In this talk  we study the problem of inference in high dimensional setting
            where each variable is assumed to be Boolean. We will present a method to
            select an appropriate partition of the variables such that variables that are not
            grouped together are assumed to be independent.

            Keywords
            Clustering;  high-dimensionality;  covariate  partition;  thresholding;  model
            selection; slope heuristic; binary data; MovieLens dataset.

            1.  Introduction
                In  many  fields  of  application  the  data  scientist  has  to  handle  a  lot  of
            covariates that are possibly depending on each other. Examples include social
            networks, the Netix problem, metagenomics surveys.
                We consider here the estimation problem of the law of p binary covariates.
            Even in the simple case of binary covariates, 2  − 1 parameters are needed to
                                                         p
            describe  their  distribution,  therefore  inference  becomes  quickly  intractable
            when p increases.
                In  this  paper  we  focus  on  this  binary  case  and  propose  a  method  to
            overcome this curse of dimensionality phenomena. We propose to project the
            data into low dimensional models. The covariates are clustered in K blocs such
            as covariates belonging to different blocs are assumed to be independent. Let
            p1, . . . , pK be the number of covariates in each bloc, such a clustering involves
            a number of parameters equal to ∑    (2   − 1).
                                               =1
                As an illustration, if each bloc contains one variable only, then K = , the
              are  all  equal  to  1  and  the  number  of  parameters  associated  with  this
              ′
             
            clustering  equals .  This  is  the  simplest  model.  On  the  opposite,  if  all  the
            covariates belong to the same bloc,   =  1 and   =  . This corresponds to
                                                             1
            the most expensive model with a number of parameters equal to 2 − 1.
                                                                             
                To any such partition in blocks, we can associate a maximum likelihood
            estimate  of  the  density  s*  of  the  observations.  We  aim  at  finding  a  good
            covariate partition.
                We propose a statistical view for this partition selection problem. Since the
            number  of  possible  partitions  exponentially  grows  with  the  number  of
            covariates, we use a procedure based on the empirical correlation matrix to


                                                               216 | I S I   W S C   2 0 1 9
   222   223   224   225   226   227   228   229   230   231   232