Page 227 - Contributed Paper Session (CPS) - Volume 4
P. 227
CPS2203 Thierry D. et al.
Model Selection for Covariates Clustering
Thierry Dumont, Ana Karina Fermin
Université Paris Nanterre
Abstract
In this talk we study the problem of inference in high dimensional setting
where each variable is assumed to be Boolean. We will present a method to
select an appropriate partition of the variables such that variables that are not
grouped together are assumed to be independent.
Keywords
Clustering; high-dimensionality; covariate partition; thresholding; model
selection; slope heuristic; binary data; MovieLens dataset.
1. Introduction
In many fields of application the data scientist has to handle a lot of
covariates that are possibly depending on each other. Examples include social
networks, the Netix problem, metagenomics surveys.
We consider here the estimation problem of the law of p binary covariates.
Even in the simple case of binary covariates, 2 − 1 parameters are needed to
p
describe their distribution, therefore inference becomes quickly intractable
when p increases.
In this paper we focus on this binary case and propose a method to
overcome this curse of dimensionality phenomena. We propose to project the
data into low dimensional models. The covariates are clustered in K blocs such
as covariates belonging to different blocs are assumed to be independent. Let
p1, . . . , pK be the number of covariates in each bloc, such a clustering involves
a number of parameters equal to ∑ (2 − 1).
=1
As an illustration, if each bloc contains one variable only, then K = , the
are all equal to 1 and the number of parameters associated with this
′
clustering equals . This is the simplest model. On the opposite, if all the
covariates belong to the same bloc, = 1 and = . This corresponds to
1
the most expensive model with a number of parameters equal to 2 − 1.
To any such partition in blocks, we can associate a maximum likelihood
estimate of the density s* of the observations. We aim at finding a good
covariate partition.
We propose a statistical view for this partition selection problem. Since the
number of possible partitions exponentially grows with the number of
covariates, we use a procedure based on the empirical correlation matrix to
216 | I S I W S C 2 0 1 9