Page 229 - Contributed Paper Session (CPS) - Volume 4
P. 229

CPS2203 Thierry D. et al.
                                                                                 ⋆
            basis of this sample we are interested in estimating the distribution  from
            the data when the dimension  is large.
                For this purpose, on the basis of the observations we consider, for each
              ∈  , an estimator    of the density  that belongs to  . One wish to
                                                     ⋆
                                                                        
                                   
            select among the family of estimators (  )  ∈  the one that realizes the
                                                     
            "best" compromise between data adjustment and the corresponding model’s
            complexity. However, such a selection requires an exhaustive exploration of all
            possible partitions which is intractable.
                We propose in Section 3.1 a method to considerably restrict the set of all
            partitions  to  a  set   that  naturally  arises  from  thresholding  the  empirical
                                ̂
            correlation matrix. Section 3.2 provides a criteria that allows to choose among
                                      ∈ ̂ .
            the candidate densities (  )

            3.1 Data-driven partition set
                Let  = ((, ))   be the empirical   ×   correlation matrix associated
                               ×
                                                          p
            with  the  data   = (y1,y2, . . .,yn) with  yi ∈ {0,1} .  For  any  and  in {1, . . . , },
                        ()
             ,  = ( ,  () ).
                Let λ ∈ [0,1) be a threshold, de ne the partition   associated with λ: define
                                                              
                                                         )
            the adjacency matrix  = ( (, ) = 1 |(,)|> (,)∈× .
                                  
                                        
                let   = (, ) be the graph associated with  , where the vertexes (nodes)
                                                            
             represent the covariates and  the collection of non oriented edges. For two
            nodes  and , {,j} ∈ E  if  (, ) = 1  that is if |(, )) > . The partition 
                                                                                      
                                       
            corresponds to the connected components of this graph. Remark that if λ is
            between two consecutive values of the |(, )|′, c1 ≤ λ < c2, then the partition
              is the same as  |1| .
              
                                                               ̂
                Therefore if Λ = {| | <  1} ∪ {0}, we consider  = { }   the resulting
                                   ,
                                                                      ∈Λ
            collection of partitions. It is straightforward to show that || ≤ .
                                                                    ̂
            Model Selection for binary Variable Clustering

            3.2 Data-driven partition selection
                Let   ,  subset  of   ,  be  a  random  collection  of  partitions
                     ̂
            ( , . . . ,  ) −measurable. We wish to select among  a model  that allows
                                                                ̂
              1
                    
            theorical  bounds  on  the  distance  between  the  true  density   and   .  We
                                                                          ⋆
                                                                                 
            propose a penalized version of the maximum likelihood estimator defined as
            following.
                Next, let ̂ be defined as
                                              
                                           1
                                  
                             ̂ =      {− ∑ log(  ( )) + ()}                       (3.1)
                                  ∈ ̂            
                                             =1
            where, if m = {B1,B2,...,BK},
                                                               218 | I S I   W S C   2 0 1 9
   224   225   226   227   228   229   230   231   232   233   234