Page 221 - Special Topic Session (STS) - Volume 4
P. 221

STS582 Júlia M. P. S.



                            Proteogenomics: statistical issues in data

                                   integration and prediction
                                        Júlia M Pavan Soler
                          Statistics Department, University of São Paulo, São Paulo, SP

            Abstract
            Proteogenomics  inaugurates  a  new  phase  of  multi-omics  research  in
            Molecular Biology, seeking to integrate information of large datasets from the
            genome,  transcriptome  and  proteome  to  clinical  traits.  The  promise  is  to
            identify patient-specific biomarkers, which can be used on the prognostic in
            precision medicine. However, the expected contribution of this area depends
            on overcoming several interdisciplinary challenges, ranging from the design
            of experiments for samples preparation, storage, processing and integration
            of data to its analysis and interpretation. Brazil, as other countries, starts to
            dedicate efforts to the proteogenomic analysis of many diseases in order to
            identify  specific  and  common  biomarkers  among  world  populations.
            Specifically, the Baependi Family Heart Study is one of the largest ongoing
            efforts for molecular mapping in cardiovascular diseases in our country, which
            includes   Brazilian   family   information.   Statistical   approaches   in
            proteogenomics are typically formulated assuming unrelated individuals, and
            if family structure is present and ignored, such substructures may induce to
            misleading  results.  In  this  talk,  in  the  context  of  proteogenomics,  we  will
            consider  flexible  methodologies  for  dimensionality  reduction,  variable
            selection  and  structure  learning  taking  in  account  sparsity,  dependent
            observations and missing information.

            Keywords
            Matrix  factorization;  Varying  coefficients;  Multi-omics  data;  Family  based
            design; Complex data

            1.  Introduction
                A relevant issue that is becoming increasingly important in the big and
            complex data age is data integration. An early version of that trend can be
            seen in the multi-omics studies, as exemplified by proteogenomics studies,
            seeking  to  integrate  many  sources  of  information  from  large  datasets
            measured  on  a  common  set  of  experimental  subjects  to  clinical  traits.  In
            general, the integration scope in these studies try to cover the central dogma
            of the Molecular Biology including data from genome (such as, SNP and CNV
            platforms),  epigenome  (such  as,  methylation  data),  transcriptome (such as,
            Microarrays and RNA-seq data) and proteome (such as, LC-MS/MS data) to
            phenome (phenotype dataset). The Cancer Genome Atlas (TCGA, Weinstein et


                                                               210 | I S I   W S C   2 0 1 9
   216   217   218   219   220   221   222   223   224   225   226