Page 232 - Contributed Paper Session (CPS) - Volume 4
P. 232

CPS2203 Thierry D. et al.
                  interpretation of our results when the data is made of movies ratings in order
                  to motivate the practical use of the statistical tool presented in this paper.

                  4.1 Simulated data set
                      To study the performance of our procedure on synthetic data we choose
                  a large number of covariates (p = 100). We randomly a ect these covariates
                  into blocs { , … ,  ∗} whose sizes  ′  = 1, . . . ,  vary from 1 to 7. Then, for
                              ∗
                                                                  ∗
                                                     ∗
                                    ∗
                                    
                              1
                                                     
                                                     ∗
                                  ⋆
                  each  ∈ 1, . . . ,  , the distribution   of   (  )  is drawn uniformly in the set of
                                                     
                  probability measures on {0,1} . The synthetic data sets manipulated in this
                                                ⋆
                                                
                                                                          ∗
                                                                ∗
                  section are  samples  of  the  product  measure  = ∏    .  Figures  1  and 2
                                                                      =1
                                                                          
                  present the performance depending on the sample size . For each considered
                  size ,  200  independent  n-samples  are  drawn  from  .  Our  procedure  is
                                                                         ⋆
                  applied on each sample providing a partition of the covariates {B1,...,BK} and an
                                            ∗
                  estimator    ̂  = ∏     of  . In Figure 1 are displayed, for each sample size 
                                   =1
                                       
                  the boxplots and the empirical mean of ℎ (  ,   over the 200 experiences.
                                                            2
                                                                  ∗
                                                                ̂
                  Figure 1 illustrates the convergence    towards  in terms of the Hellinger
                                                                   ⋆
                                                        ̂
                  distance.
                  Model Selection for binary Variable Clustering








                                           Figure 3: MovieLens groups sizes












                            Figure 4: Titles and genre of the biggest group of movies selected






                                                                     221 | I S I   W S C   2 0 1 9
   227   228   229   230   231   232   233   234   235   236   237