Page 233 - Contributed Paper Session (CPS) - Volume 4
P. 233
CPS2203 Thierry D. et al.
In Figure 2 we compute the boxplots of the proportion of covariates whose
estimated bloc corresponds to one of the true bloc . This graphic allows us
⋆
to appreciate the quality of the model selection side of the procedure.
4.2 Movielens data set
MoviLens dataset Harper and Konstan (2015) contains ratings from 137753
users on 27278 movies( excluding movies with no rating values). Ratings on a
1-5 scale and each user has rated at most 367 movies. We restrict our study to
the rst 1000 most often rated movies. For each movie and each user we study
the variable equal 1 if the user rated the movie and 0 otherwise. Our selection
method applied to the dataset selected a partition made of 322 groups of
movies whose size vary from 2 to 16. Figure 3 represents the distribution of
the number of variables by group in the partition. Figure 4 represents the
movies that belong to the biggest group. We notice that most of them are
action/Sci- /Adventure movies released in the early 2000. Similarly most of the
groups are made of movies with similar genres and years. Other examples of
groups forming the selected partition are provided by Figure 5.
5. Discussion and conclusion
Figures 4, 5 illustrate the quality of the variables clustering provided by the
method. We also provide a consistent estimator of the target distribution .
∗
This estimator may be used to understand the joint behavior of the variables
belonging to the same bloc. Conditioning ̂ can also allow prediction on new
partially observed dataset.
222 | I S I W S C 2 0 1 9