Page 34 - Invited Paper Session (IPS) - Volume 1
P. 34

IPS35 Dinov I.D. et al.
                    Table 1: Common characteristics of Big biomedical and healthcare datasets.
                    Dimensions of Big Data        Properties and Tool specifications
                   Size                           Harvesting  and  management  of  vast  amounts  of
                                                data
                   Complexity                     Wranglers for dealing with heterogeneous data
                   Incongruency                   Tools for data harmonization and aggregation
                   Multi-source                   Transfer and joint modeling of disparate elements
                   Multi-scale                    Macro to meso to micro scale observations
                   Time                           Techniques accounting for longitudinal patterns in
                                                the data
                   Incomplete                     Reliable management of missing data

                  2. Methodology
                      There are a few complementary strategies that enable scientific computing
                  on sensitive datasets. Examples of these include ℇ-differential privacy (Dwork
                  2009), homomorphic encryption (Gentry 2009), and statistical obfuscation via
                  DataSifter  (Marino,  Zhou  et  al.  2018).  Below  we  review  each  of  these
                  techniques.


                  2.1 &-differential privacy (ℇ-DP)
                      ℇ-DP provides a mechanism to mine information in databases without
                  compromising privacy. By estimating the theoretical limits on the balance
                  between  information  utility  and  risk  of  sharing  data,  this  technique
                  enables data governors to quantify the potential risks of information re-
                  identification.  However,  it  is  difficult  to  apply  on  high-dimensional,
                  unstructured, skewed, or categorical data (Dwork 2009).
                      Assume we have a dataset including measurements of the following
                  features:  {  C1,  C2,  ...  ,  Ck  },  which  can  be  categorical  or  numerical.
                  Relational databases (DBs) store lists of cases { x1, x2, ... , xn }, xi € C1 × C2 ×
                  ... ,× Ck, 1 ≤ ί ≤ ղ. ℇ-Differential privacy relies on adding noise to the data
                  in  the  database,  which  adds  protection  against  reidentification  of
                  individual records. An algorithm f is called ℇ-differentially private if for all
                  possible inputs (datasets or DBs) D1, D2 that differ on a single record and
                  all possible f outputs, y, the probability of correctly guessing D1 knowing
                  y is not significantly different from the corresponding probability of D2
                  given y. In other words,



                      Clearly the small positive number, ℇ>0 and e  ~1, controls the level of
                                                                 ℇ
                  uncertainty about reidentification of the source data (D1 or D2) from the
                  known observation, y. The global sensitivity of f is the smallest number


                                                                     23 | I S I   W S C   2 0 1 9
   29   30   31   32   33   34   35   36   37   38   39