Page 89 - Invited Paper Session (IPS) - Volume 1
P. 89

IPS98 Luciana D. V. at al.
                This approach involves the selection of a calibration link variable and the
            creation of a new artificial data set by suitably resampling the observations
            belonging to the classes of the calibration link. In particular, the calibration
            link variable is resampled by oversampling with replacement the minority class
            and by undersampling without replacement the majority class.  More formally,
            the  resampling  approach  can  be  described  as  follows.  Let  us  consider  the
            variables denoted by the pairs (x, y), where x represents a set of measured
            characteristics and y is a target (or key) variable. Here, we consider the specific
            case where x is defined in a d-dimensional space X being the product set
            between discrete domains, and the target variable y, which is affected by class
            imbalance, takes values in the categorical domain Y= {Ymin, Ymaj}, where Ymin is
            the minority class and Ymaj is the majority class.  Suppose that a sample Dn =
            (x1, y1), …, (xn, yn), of the pairs (x, y), whose generic row is (xi, yi), i = 1, …, n, is
            observed on n individuals or objects. The class labels yi belong to the set {Ymin,
            Ymaj}  and  xi  are  some  related  attributes  supposed  to  be  realizations  of  a
            random vector x. Let the number of units in class Yj, j = min, maj, be denoted
            by nj < n and the corresponding class proportions be denoted by pj = nj /n.
            The  resampling  procedure  for  generating  a  new  artificially  re-balanced
            dataset, consists of the following steps:
            1)  Select y* = Yj with probability 1/2.
            2)  Select (xi, yi)  Dn, such that yi = y*, with probability 1/nj.
               a.  If y* = Ymin, oversample with replacement by adding (xi, y*) to Dn;
               b.  If y* = Ymaj, undersample without replacement by removing (xi, y*) from
                  Dn.
                Repeat steps 1 and 2 until the desired class proportions are achieved or
            until the minority class reaches the desired size.  This procedure produces a
            new  rebalanced  dataset  Dm*,  of  size  m,  where  the  desired  proportions  of
            observations  belong  to  the  two  classes.  For  more  details  about  the  class
            imbalance  problem  and  resampling  techniques  see,  for  example,  Chawla
            (2005) and Menardi and Torelli (2014).
                In the present work, the resampling approach described above is applied
            to  interview-  and  online-based  imbalanced  datasets  to  achieve  data
            integration.  Following this bias correction, BNs are built to identify the main
            determinants of customer satisfaction.
                The proposed data integration methodology is structured in three phases:
                                                SU
            1)  Data  structure  modelling.  Let  D   denote  the  interview-based  survey
               dataset and D  denote the social media dataset. This phase consists in
                             SM
               implementing  BNs  to  construct  the  causal  relationships  between  the
                                                      SU
                                                                            SM
               variables of both the customer survey, D , and social media, D , datasets,
               separately. BNs are chosen amongst other data modelling techniques for
               their  flexibility  and  ability  to  encode  probabilistic  relationships  among
               variables of interest, allowing an easy identification of the determinants of

                                                               78 | I S I   W S C   2 0 1 9
   84   85   86   87   88   89   90   91   92   93   94