Page 223 - Special Topic Session (STS) - Volume 4
P. 223
STS582 Júlia M. P. S.
biologically relevant molecular signatures and their analysis can suggest novel
biological hypotheses.
Another class of N-integration techniques is based on a flexible regression
framework. Under an unsupervised approach, probabilistic graphical models
(PGMs) can be used for learning relations among multiple variables
(Meinshausen et al., 2006). Tenenhaus et al. (2014) proposed a generalized
canonical correlation analysis for N-integration with loads in the optimization
problem defined in terms of the connections in a PGM. In addition, supervised
N-integration can be performed by incorporating varying coefficients (Hastie
and Tibshirani, 1993) into the regression model, with the multi-omics
integration oriented for prediction of clinical outcomes. In this context, Ni et
al. (2018) proposed a Bayesian hierarchical varying-sparsity regression model
and apply for genomic and proteomic data integration to be prognostic for
the patient’s survival time.
Further, the P-integration of independent data sets measured on the same
common set of variables (omics data) can be a useful opportunity to increase
sample size and gain statistical power. The main challenge in this case is to
prevent the analysis from systematic heterogeneities arising from the different
sources of variation, as those coming from different protocols. For instance,
batch and multi-center effects are unwanted variation, which often acts as
strong confounders in the P-integration analysis. Such effects may lead to
spurious conclusions if they are not accounted for in the statistical model.
Despite the recent progress made in the area of multi-omics integration,
the methods assume independent observations (unrelated individuals), and if
family structure is present and ignored in the analysis, such substructures may
induce artefactual results for data integration. For instance, in the context of
uni-omics data, specifically considering large pedigrees and high dimensional
SNP-genotype data, de Andrade et al. (2015) obtained valid principal
components estimators and showed that the latent variables taking into
account the family structure are more informative than those ignoring such
substructure. Ribeiro and Soler (2018), who proposed a probabilistic graphical
model for learning relationships among multiple variables from family data,
also consider the impact of clustered observation at the analysis. The outline
of this work is as follows. First, we will review and discuss unsupervised and
supervised multi-omics data integration methods, under the assumption of
unrelated samples. Subsequently, we will consider family based designs,
incorporate dependence among related individuals and exploit how the
covariance matrix among variables is decomposed into genetic and
environmental components. Finally, we will discuss the advancement of data
integration methods to take into account family structure present on the data.
212 | I S I W S C 2 0 1 9