Page 34 - Invited Paper Session (IPS) - Volume 1
P. 34
IPS35 Dinov I.D. et al.
Table 1: Common characteristics of Big biomedical and healthcare datasets.
Dimensions of Big Data Properties and Tool specifications
Size Harvesting and management of vast amounts of
data
Complexity Wranglers for dealing with heterogeneous data
Incongruency Tools for data harmonization and aggregation
Multi-source Transfer and joint modeling of disparate elements
Multi-scale Macro to meso to micro scale observations
Time Techniques accounting for longitudinal patterns in
the data
Incomplete Reliable management of missing data
2. Methodology
There are a few complementary strategies that enable scientific computing
on sensitive datasets. Examples of these include ℇ-differential privacy (Dwork
2009), homomorphic encryption (Gentry 2009), and statistical obfuscation via
DataSifter (Marino, Zhou et al. 2018). Below we review each of these
techniques.
2.1 &-differential privacy (ℇ-DP)
ℇ-DP provides a mechanism to mine information in databases without
compromising privacy. By estimating the theoretical limits on the balance
between information utility and risk of sharing data, this technique
enables data governors to quantify the potential risks of information re-
identification. However, it is difficult to apply on high-dimensional,
unstructured, skewed, or categorical data (Dwork 2009).
Assume we have a dataset including measurements of the following
features: { C1, C2, ... , Ck }, which can be categorical or numerical.
Relational databases (DBs) store lists of cases { x1, x2, ... , xn }, xi € C1 × C2 ×
... ,× Ck, 1 ≤ ί ≤ ղ. ℇ-Differential privacy relies on adding noise to the data
in the database, which adds protection against reidentification of
individual records. An algorithm f is called ℇ-differentially private if for all
possible inputs (datasets or DBs) D1, D2 that differ on a single record and
all possible f outputs, y, the probability of correctly guessing D1 knowing
y is not significantly different from the corresponding probability of D2
given y. In other words,
Clearly the small positive number, ℇ>0 and e ~1, controls the level of
ℇ
uncertainty about reidentification of the source data (D1 or D2) from the
known observation, y. The global sensitivity of f is the smallest number
23 | I S I W S C 2 0 1 9