Page 201 - Contributed Paper Session (CPS) - Volume 4
P. 201

CPS2174 Septian R. et al.
            developed to use the logistic regression in the model averaging process when
            the categorical scale in response variable [3].
                This  research  foccus  on  constructing  the  model  candidate  of  logistic
            regression  in  case  of  prediction  class  of  response  variable.  The  model
            candidate is cunstructed by selecting the predictor variables randomly to get
            the prediction of response variable class. This process applied several times to
            get some prediction and then the prediction would be averaged using the
            specified weight. In this case, the probability form is used in the prediction of
            response variable to be avaraged.
                The  data  that  used  in  this  research  is  RNA-seq  data  set  that  part  of a
            random  extraction  gene  expression  of  patients  having  different  types  of
            tumor:  KIRC  (Kidney  Renal  Clear-Cell  Carcinoma)  and  LUAD  (Lung
            Adenocarcinoma). This data set contains 20532 gene based on 287 patients
            [4]. The number of gene selected in model candidate is 50 genes, 100 genes,
            and  150  genes  with  number  of  model  candidate  contain  50  models.  In
            practices, there are selected about 40% part of patients to be the testing data
            to evaluate the accuration, sensitivity, and spesificity of prediction.

            2.  Methodology
                In this section would be described the data set that used in this research,
            the model averaging concept in logistic regression approach, and also the
            evaluation of the prediction.

                2.1 Data
                The data set that used in this research is a part of The Cancer Genome
            Atlas (TCGA) Research to profile and analyze large numbers of human tumors
            to discover molecular aberrations [4]. This research took the subset of this data
            on patients having KIRC and LUAD based on their RNA-seq. Therefore, the
            response variable of this research is class of patients based on their suffered.
            The number of patients in this data is n = 287 patients with p = 20532 genes
            to be the predictor variables, which is include in the high dimensional data
            (p≫n). For data analysis, the class KIRC simbolized by 1 and LUAD is 0.

                2.2 Model Averaging

                Let  ×  is  high  dimensional  data  with  number  of  observations  and
            number of predictors  ( ≫ ), and  ∗ ×  is the subset of  with number of
            predictors  ( < ). Let  ×1  is the response variable in the case. Assume the
            regression model of the subset predictor data is  = ( ) + . The model
                                                                      ∗
            averaging concept is creating some model candidates or the subset predictor
            model to combine to be the representative form of final model. The number
            of  model  candidates  is    which  contains    predictors  in  each  model.



                                                               190 | I S I   W S C   2 0 1 9
   196   197   198   199   200   201   202   203   204   205   206