Page 222 - Special Topic Session (STS) - Volume 4
P. 222

STS582 Júlia M. P. S.
                  al. 2013) project provides a powerful source of such set of data blocks. These
                  cross-platform datasets share common information, but individually contain
                  distinctive patterns. Disentangling between common and distinctive patterns,
                  and  also  between  the  noise  component,  is  critically  important  to  perform
                  integrative, discriminative and predictive analysis of these datasets (Smilde et
                  al., 2017; Shu et al., 2018). Whilst single omics analyses, under an unsupervised
                  or  supervised scope,  are  commonly  used  for  dimensionality  reduction and
                  selection of relevant features for specific analytical frameworks, the integration
                  of multi-omics information is required to more fully unravel the complexities
                  of biological systems.
                      The  first  step  on  the  omics  data  analysis  involves  detection  and
                  adjustments  for  undesirable  variable  effects,  which  will  tend  to  appear  in
                  addition to the measured variable(s) of interest among most, if not all, high-
                  throughput  technologies  (Leek  et  al.,  2010).  Failing  for  correction  of  these
                  sources  of  heterogeneity  into  the  analysis  can  have  widespread  and
                  detrimental  effects  on  the  study,  not  only  reducing  power  and  inducing
                  unwanted  dependence  across  genes,  but  it  can  also  introduce  sources  of
                  spurious  signals.  This  phenomenon  is  true  even  for  well-designed  and
                  randomized studies. For instance, considering whole-genome SNP platforms,
                  Price et al. (2006) applied singular value decomposition to the genotype called
                  data in order to account for systematic sources of variation due to population
                  substructure.  In  addition,  for  batch  effects  correction  and  normalization  in
                  gene expression data, there are many methods based on nonparametric and
                  parametric approaches (Wolfinger et al., 2001; Irizarry et al., 2003; Leek and
                  Storey,  2007;  Chen  et  al.,  2011).  Further,  undesirable  effects  in  mass
                  spectrometry-based  proteomics  data  have  been  treated  by  using  smooth
                  curves and ANOVA-Simultaneous Component Analysis (Clough et al., 2012;
                  Mitra  et  al.,  2016).  Although  all  these  tools  are  available,  for  database
                  integration there is no consensus whether the normalization step should be
                  done  through  uni  or  researchers  need  to  work  directly  with  the  raw  data,
                  making  normalization  and  integration  in  a  unique  step,  which  is  both
                  statistically and computationally challenging and a topic of current research.
                      Data  integration  can  be  performed  through  N-integration  (variables
                  integration), which consider different omics platforms evaluated on the same
                  samples,  or  P-integration  (sample  unities  integration),  i.e.,  concatenation
                  across  studies  on  the  same  variables.  Typical  techniques  for  database  N-
                  integration use multivariate projection-based methods, as low-rank models,
                  that embed both the sample unities and features of the data blocks into the
                  same low dimensional vector space (Lê Cao et al., 2009; Tenenhaus et al., 2011,
                  2014; Ray et al., 2017). These low dimensional vectors enable effective data
                  analytics,  such  as  clustering,  visualization  and  missing  value  imputation. In
                  addition,  these  vectors  are  latent  variables  or  scores  possibly  representing



                                                                     211 | I S I   W S C   2 0 1 9
   217   218   219   220   221   222   223   224   225   226   227