Page 229 - Contributed Paper Session (CPS) - Volume 4
P. 229
CPS2203 Thierry D. et al.
⋆
basis of this sample we are interested in estimating the distribution from
the data when the dimension is large.
For this purpose, on the basis of the observations we consider, for each
∈ , an estimator of the density that belongs to . One wish to
⋆
select among the family of estimators ( ) ∈ the one that realizes the
"best" compromise between data adjustment and the corresponding model’s
complexity. However, such a selection requires an exhaustive exploration of all
possible partitions which is intractable.
We propose in Section 3.1 a method to considerably restrict the set of all
partitions to a set that naturally arises from thresholding the empirical
̂
correlation matrix. Section 3.2 provides a criteria that allows to choose among
∈ ̂ .
the candidate densities ( )
3.1 Data-driven partition set
Let = ((, )) be the empirical × correlation matrix associated
×
p
with the data = (y1,y2, . . .,yn) with yi ∈ {0,1} . For any and in {1, . . . , },
()
, = ( , () ).
Let λ ∈ [0,1) be a threshold, de ne the partition associated with λ: define
)
the adjacency matrix = ( (, ) = 1 |(,)|> (,)∈× .
let = (, ) be the graph associated with , where the vertexes (nodes)
represent the covariates and the collection of non oriented edges. For two
nodes and , {,j} ∈ E if (, ) = 1 that is if |(, )) > . The partition
corresponds to the connected components of this graph. Remark that if λ is
between two consecutive values of the |(, )|′, c1 ≤ λ < c2, then the partition
is the same as |1| .
̂
Therefore if Λ = {| | < 1} ∪ {0}, we consider = { } the resulting
,
∈Λ
collection of partitions. It is straightforward to show that || ≤ .
̂
Model Selection for binary Variable Clustering
3.2 Data-driven partition selection
Let , subset of , be a random collection of partitions
̂
( , . . . , ) −measurable. We wish to select among a model that allows
̂
1
theorical bounds on the distance between the true density and . We
⋆
propose a penalized version of the maximum likelihood estimator defined as
following.
Next, let ̂ be defined as
1
̂ = {− ∑ log( ( )) + ()} (3.1)
∈ ̂
=1
where, if m = {B1,B2,...,BK},
218 | I S I W S C 2 0 1 9