Page 232 - Contributed Paper Session (CPS) - Volume 4
P. 232
CPS2203 Thierry D. et al.
interpretation of our results when the data is made of movies ratings in order
to motivate the practical use of the statistical tool presented in this paper.
4.1 Simulated data set
To study the performance of our procedure on synthetic data we choose
a large number of covariates (p = 100). We randomly a ect these covariates
into blocs { , … , ∗} whose sizes ′ = 1, . . . , vary from 1 to 7. Then, for
∗
∗
∗
∗
1
∗
⋆
each ∈ 1, . . . , , the distribution of ( ) is drawn uniformly in the set of
probability measures on {0,1} . The synthetic data sets manipulated in this
⋆
∗
∗
section are samples of the product measure = ∏ . Figures 1 and 2
=1
present the performance depending on the sample size . For each considered
size , 200 independent n-samples are drawn from . Our procedure is
⋆
applied on each sample providing a partition of the covariates {B1,...,BK} and an
∗
estimator ̂ = ∏ of . In Figure 1 are displayed, for each sample size
=1
the boxplots and the empirical mean of ℎ ( , over the 200 experiences.
2
∗
̂
Figure 1 illustrates the convergence towards in terms of the Hellinger
⋆
̂
distance.
Model Selection for binary Variable Clustering
Figure 3: MovieLens groups sizes
Figure 4: Titles and genre of the biggest group of movies selected
221 | I S I W S C 2 0 1 9