Page 89 - Invited Paper Session (IPS) - Volume 1
P. 89
IPS98 Luciana D. V. at al.
This approach involves the selection of a calibration link variable and the
creation of a new artificial data set by suitably resampling the observations
belonging to the classes of the calibration link. In particular, the calibration
link variable is resampled by oversampling with replacement the minority class
and by undersampling without replacement the majority class. More formally,
the resampling approach can be described as follows. Let us consider the
variables denoted by the pairs (x, y), where x represents a set of measured
characteristics and y is a target (or key) variable. Here, we consider the specific
case where x is defined in a d-dimensional space X being the product set
between discrete domains, and the target variable y, which is affected by class
imbalance, takes values in the categorical domain Y= {Ymin, Ymaj}, where Ymin is
the minority class and Ymaj is the majority class. Suppose that a sample Dn =
(x1, y1), …, (xn, yn), of the pairs (x, y), whose generic row is (xi, yi), i = 1, …, n, is
observed on n individuals or objects. The class labels yi belong to the set {Ymin,
Ymaj} and xi are some related attributes supposed to be realizations of a
random vector x. Let the number of units in class Yj, j = min, maj, be denoted
by nj < n and the corresponding class proportions be denoted by pj = nj /n.
The resampling procedure for generating a new artificially re-balanced
dataset, consists of the following steps:
1) Select y* = Yj with probability 1/2.
2) Select (xi, yi) Dn, such that yi = y*, with probability 1/nj.
a. If y* = Ymin, oversample with replacement by adding (xi, y*) to Dn;
b. If y* = Ymaj, undersample without replacement by removing (xi, y*) from
Dn.
Repeat steps 1 and 2 until the desired class proportions are achieved or
until the minority class reaches the desired size. This procedure produces a
new rebalanced dataset Dm*, of size m, where the desired proportions of
observations belong to the two classes. For more details about the class
imbalance problem and resampling techniques see, for example, Chawla
(2005) and Menardi and Torelli (2014).
In the present work, the resampling approach described above is applied
to interview- and online-based imbalanced datasets to achieve data
integration. Following this bias correction, BNs are built to identify the main
determinants of customer satisfaction.
The proposed data integration methodology is structured in three phases:
SU
1) Data structure modelling. Let D denote the interview-based survey
dataset and D denote the social media dataset. This phase consists in
SM
implementing BNs to construct the causal relationships between the
SU
SM
variables of both the customer survey, D , and social media, D , datasets,
separately. BNs are chosen amongst other data modelling techniques for
their flexibility and ability to encode probabilistic relationships among
variables of interest, allowing an easy identification of the determinants of
78 | I S I W S C 2 0 1 9