Page 88 - Contributed Paper Session (CPS) - Volume 2
P. 88
CPS1440 Avner Bar-Hen et al.
classes are equally frequent. The two most popular heterogeneity criteria are
the Shannon entropy and the Gini index. The growing procedure is stopped
when there is no more admissible splitting. Each leaf is assigned to the most
frequent class of its observations.
Of course, such a maximal tree (denoted by Tmax) generally overfits the
training data and the associated prediction error R(Tmax), with
() = ℙ(( , … . , ) ≠ ) (1)
1
is typically large. Since the goal is to build from the available data a tree T
whose prediction error is as small as possible, in a second stage the tree Tmax
0
is pruned to produce a subtree T whose expected performance is close to the
0
minimum of R(T ) over all binary subtrees T of Tmax. Since the joint distribution
0
p
P of (X ,...,X ,Y ) is unknown, the pruning is based on the penalized empirical
1
ˆ
risk R pen(T) to balance optimistic estimates of empirical risk by adding a
complexity term that penalizes larger subtrees. More precisely the empirical
risk is penalized by a complexity term, which is linear in the number of leaves
of the tree:
1
() = ∑ {( ,…., )≠ } + || (2)
̂
1
=1
where is the indicator function, n the total number of observations, α a
positive penalty constant, || denotes the number of leaves of the tree and
is the th random realization of
2.2 Intertype K-function
A point process is a random variable that gives the localization of events
in a set W ⊂ R . Another way to define a given point process is to consider,
d
for each B ⊂ W, the number of events φ(B) occurring in B, where φ is the
distribution of the occurrences of the point process.
Since characterization of a spatial repartition is strongly dependent on the
scale of observation, the repartition has to be characterized for each scale.
A marked point process is a point process such that a random mark is
associated with each localization. Here, we only consider bivariate point
processes, i.e. the mark is a qualitative random variable with two possible
issues. Equivalently, the bivariate point process can be viewed as the
realization of two point processes (one par level of the mark).
There are several ways to consider the relationships between two clouds
of points, mainly related to three aspects: independence, association and
random labelling (see [1] for example). It ends up that relationships between
two clouds of points can be described in various ways and therefore many
indices can be defined. Each index will give a specific information about these
relationships and will greatly depends on the point process that leads to the
77 | I S I W S C 2 0 1 9