Page 186 - Contributed Paper Session (CPS) - Volume 2
P. 186
CPS1820 Shuichi S.
High-dimensional microarray data analysis
- First success of cancer gene analysis and cancer
gene diagnosis -
Shuichi Shinmura
Emeritus Professor, Seikei University
Abstract
From October 28 to December 20, 2015, we discriminated six microarrays by
Revised IP-OLDF (RIP) based on the minimum number of misclassifications
(Minimum NM, MNM) criterion. All data are linear separable data (LSD,
MNM=0). We call linearly separable space and subspaces as Matryoshka. LSD
has the Matryoshka structure that includes smaller Matryoshka in it. We
developed a Matryoshka feature selection method (Method2) that could
decompose each microarray into many small Matryoshkas (SMs) and noise
subspace (MNM>=1). Because all SMs are small samples, statistical methods
can analyze all SMs. However, we cannot find the linearly separable facts. Thus,
we make new signal data by RIP discriminant scores (RipDSs) instead of genes.
We think RipDSs are malignancy indexes for cancer gene diagnosis. In this
paper, we explain the cancer gene diagnosis using Alon’s microarray that
consists of 62 patients by 2,000 genes. Method2 decomposes Alon microarray
into 64 SMs. Thus, we make the new data that consists of 62 patients and 64
RipDSs instead of 2,000 genes. We make 64 RipDSs signal data from 2,000
high-dimensional gene data. If we analyze this signal data by Ward cluster,
two classes become two clusters. Moreover, we can make cancer class into
over two clusters. Next, if Principal Component Analysis (PCA) analyses the
signal data, we can examine several clusters those explain the new sub-class
of cancer pointed by Golub et al. Many researchers could not solve the high-
dimensional microarray because of several reasons. 1) They could not find six
microarrays are LSD (Fact3). However, RIP and a hard-margin SVM (H-SVM)
find Fact3 easily.2) Method2 by RIP decomposes each microarray into many
SMs and other noise subspace (Fact4). However, Method2 by H-SVM cannot
find SMs. Although all SMs are small samples, statistical methods cannot show
the linear separable facts. Thus, we make a signal data using RipDSs instead
of genes. Cluster analysis shows two classes become two clear clusters and
PCA shows two classes are separate on the first principal component (Prin1)
that becomes another malignancy index in addition to many RipDSs. Our
approach is beneficial for cancer gene diagnosis by microarray.
Keywords
Linear Separable Data (LSD); Small Matryoshka (SM); Revised IP-OLDF (RIP);
Malignancy Index; Matryoshka Feature Selection Method (Method2)
175 | I S I W S C 2 0 1 9