Page 221 - Special Topic Session (STS) - Volume 4
P. 221
STS582 Júlia M. P. S.
Proteogenomics: statistical issues in data
integration and prediction
Júlia M Pavan Soler
Statistics Department, University of São Paulo, São Paulo, SP
Abstract
Proteogenomics inaugurates a new phase of multi-omics research in
Molecular Biology, seeking to integrate information of large datasets from the
genome, transcriptome and proteome to clinical traits. The promise is to
identify patient-specific biomarkers, which can be used on the prognostic in
precision medicine. However, the expected contribution of this area depends
on overcoming several interdisciplinary challenges, ranging from the design
of experiments for samples preparation, storage, processing and integration
of data to its analysis and interpretation. Brazil, as other countries, starts to
dedicate efforts to the proteogenomic analysis of many diseases in order to
identify specific and common biomarkers among world populations.
Specifically, the Baependi Family Heart Study is one of the largest ongoing
efforts for molecular mapping in cardiovascular diseases in our country, which
includes Brazilian family information. Statistical approaches in
proteogenomics are typically formulated assuming unrelated individuals, and
if family structure is present and ignored, such substructures may induce to
misleading results. In this talk, in the context of proteogenomics, we will
consider flexible methodologies for dimensionality reduction, variable
selection and structure learning taking in account sparsity, dependent
observations and missing information.
Keywords
Matrix factorization; Varying coefficients; Multi-omics data; Family based
design; Complex data
1. Introduction
A relevant issue that is becoming increasingly important in the big and
complex data age is data integration. An early version of that trend can be
seen in the multi-omics studies, as exemplified by proteogenomics studies,
seeking to integrate many sources of information from large datasets
measured on a common set of experimental subjects to clinical traits. In
general, the integration scope in these studies try to cover the central dogma
of the Molecular Biology including data from genome (such as, SNP and CNV
platforms), epigenome (such as, methylation data), transcriptome (such as,
Microarrays and RNA-seq data) and proteome (such as, LC-MS/MS data) to
phenome (phenotype dataset). The Cancer Genome Atlas (TCGA, Weinstein et
210 | I S I W S C 2 0 1 9