Page 243 - Special Topic Session (STS) - Volume 1
P. 243

STS426 Asis K.C.
            P256, P1024 are peak flux measured in 64,256 and 1024 ms bin, respectively;
            T50, T90 are time within which 50% and 90% of the flux arrive. Unit of fluence
            is given in ergs per square centimeter (ergs cm−2), unit of peak flux is count
            per square centimeter per second (cm−2 s−1) and unit of time is second (s).
            First, observations on each variable are standardized because the ranges of
            the  variables  vary  largely.  Then,  for  a  particular  choice  of  kernel,  KPCA  is
            performed on them. They extracted nonlinear features using the significant
            kernel principal components. Then using them as study variables, k-means
            clustering method (Hartigan–Wong clustering algorithm (Hartigan et al. 1979))
            is performed on them in which the number of clusters is determined with the
            help of gap statistic (Tibshirani et al. (2001)).
                Kernel Principal Components are supposed to carry less information and
            more noise with increasing order and after a certain order they fail to give any
            relevant information regarding the data under study. So, they started choosing
            first  few  KPCs  and  number  of  chosen  KPCs  is  increased  as  long  as  their
            performance gets better in terms of an accuracy measure. In this context, they
            used one accuracy measure as the Dunn index (Dunn (1974)), which indicates
            the internal validation of a performed clustering, taking values between 0 and
            ∞ with greater value indicating better clustering.

            3.  Results and interpretation
                First, k-means clustering method is applied to the standardized variables
            of GRB data set in which gap statistic gives no clustering in GRBs, i.e., raw GRB
            data set fails to reveal the inherent clustering nature in GRBs. Then the same
            method is applied to the principal components, extracted from the GRB data
            through  principal  component  analysis.  Linear  features  (first  two  PCs),
            explaining more than 80% variation in data, results in one group of GRBs. Thus
            linear information on data can't expose the natural groups present in GRBs. A
            new  choice  of  Kernel  successfully  reveals  the  inherent  clustering  nature  in
            GRBs, by extracting the relevant nonlinear information from raw data in terms
            of kernel principal components. We see the first two KPCs, extracted through
            kernel (10) with p < 1 and for every choice of s considered, are enough to
            describe the data. In Chattopadhyay et al. (2007), k-means clustering approach
            is directly applied to differently chosen study variables and 1594 GRBs are
            clustered in three groups of sizes 622, 423, and 549, respectively with 4.08 %
            1-NN classification error rate. While their clustering of 1972 GRBs based on
            the first two kernel principal components, extracted by proposed kernel with
            p = 1/2 and s = σ1, group those 1594 GRBs in three clusters of sizes 827, 438,
            and 329, respectively with 0.2% 1-NN classification error rate.
                Here they not only reduced the burden of the data, but also extracted the
            inherent information from the data, on which simple clustering method reveals
            the natural groups in GRBs. We propose a new possible way, kernel principal

                                                               232 | I S I   W S C   2 0 1 9
   238   239   240   241   242   243   244   245   246   247   248