Page 222 - Special Topic Session (STS) - Volume 4
P. 222
STS582 Júlia M. P. S.
al. 2013) project provides a powerful source of such set of data blocks. These
cross-platform datasets share common information, but individually contain
distinctive patterns. Disentangling between common and distinctive patterns,
and also between the noise component, is critically important to perform
integrative, discriminative and predictive analysis of these datasets (Smilde et
al., 2017; Shu et al., 2018). Whilst single omics analyses, under an unsupervised
or supervised scope, are commonly used for dimensionality reduction and
selection of relevant features for specific analytical frameworks, the integration
of multi-omics information is required to more fully unravel the complexities
of biological systems.
The first step on the omics data analysis involves detection and
adjustments for undesirable variable effects, which will tend to appear in
addition to the measured variable(s) of interest among most, if not all, high-
throughput technologies (Leek et al., 2010). Failing for correction of these
sources of heterogeneity into the analysis can have widespread and
detrimental effects on the study, not only reducing power and inducing
unwanted dependence across genes, but it can also introduce sources of
spurious signals. This phenomenon is true even for well-designed and
randomized studies. For instance, considering whole-genome SNP platforms,
Price et al. (2006) applied singular value decomposition to the genotype called
data in order to account for systematic sources of variation due to population
substructure. In addition, for batch effects correction and normalization in
gene expression data, there are many methods based on nonparametric and
parametric approaches (Wolfinger et al., 2001; Irizarry et al., 2003; Leek and
Storey, 2007; Chen et al., 2011). Further, undesirable effects in mass
spectrometry-based proteomics data have been treated by using smooth
curves and ANOVA-Simultaneous Component Analysis (Clough et al., 2012;
Mitra et al., 2016). Although all these tools are available, for database
integration there is no consensus whether the normalization step should be
done through uni or researchers need to work directly with the raw data,
making normalization and integration in a unique step, which is both
statistically and computationally challenging and a topic of current research.
Data integration can be performed through N-integration (variables
integration), which consider different omics platforms evaluated on the same
samples, or P-integration (sample unities integration), i.e., concatenation
across studies on the same variables. Typical techniques for database N-
integration use multivariate projection-based methods, as low-rank models,
that embed both the sample unities and features of the data blocks into the
same low dimensional vector space (Lê Cao et al., 2009; Tenenhaus et al., 2011,
2014; Ray et al., 2017). These low dimensional vectors enable effective data
analytics, such as clustering, visualization and missing value imputation. In
addition, these vectors are latent variables or scores possibly representing
211 | I S I W S C 2 0 1 9