Page 249 - Special Topic Session (STS) - Volume 1
P. 249

STS426 Didier Fraix-Burnet
            spectra of galaxies, the first study using k-means is recent (Sánchez Almeida
            et al. 2010) and has been disputed in De et al. (2016). Indeed, the data set is
            so large that this simple technique is not able to detect structures.
                This prompted us to use more sophisticated techniques to automatically
            and objectively build a statistical classification of galaxy spectra. Firstly, we
            want  to  use  unsupervised  clustering  since  we  are  convinced  that  using
            supervised  learning  is  currently  weird  since  the  training  set  is  devised  by
            human subjectivity. Data mining is a better approach to know what kind of
            structures in the data set algorithms are able to detect. Secondly, the large
            amount  of  parameters  (wavelengths)  is  very  probably  redundant.  Sánchez
            Almeida, et al. (2010) as selected a priori physical interesting regions of the
            spectra,  but  this  introduces  biases  for  the  discriminative  power  of  the
            classification.  Using  dimensionality  reduction  techniques  such  as  Principal
            Component Analysis has only been used up to now to separate spectra of
            galaxies from those of stars. However, principal components are known to be
            unadequate to perform a clustering (Chang 1983). As a consequence, we have
            chosen a discriminative latent subspace clustering approach designed both to
            reduce  the  dimensionality  and  to  perform  an  unsupervised  clustering,  as
            described in the Methodology section.

            2.  Methodology
                The data consists in 702248 spectra of galaxies and quasars with resdshift
            smaller than 0.25 that were retrieved from the Sloan Digital Sky Survey (SDSS)
            database,  release  7  (http://www.sdss.org/dr7/).  These  data  and  their
            preparation are described in De et al. (2016) except that we here do not select
            wavelength bands to reduce the number of wavelengths initially of more than
            3000  to  around  1500.  Rather,  we  applied  a  wavelet  filtering  of  the  noise
            followed by a binning by a factor of 2.
                The unsupervised clustering was performed with the algorithm FisherEM
            available  in  R  (Bouveyron  &  Brunet  2012).  The  Fisher-EM  algorithm  is  a
            discriminative  latent  mixture  model  that  estimates  both  the  discriminative
            subspace  and  the  parameters  of  the  mixture  model.  It  is  based  on  the
            Expectation-Maximization  (EM)  algorithm  from  which  an  additional  step,
            named F-step, is introduced, between the E- and the M-step. This F-step uses
            the Fisher’s criterion under orthonormality constraints and conditionally to the
            posterior probabilities to optimize the clustering.
                Due to computational constraints and an algorithm currently written in R,
            we  analyzed  several  sets  of  100  000  spectra,  as  well  as  one  with  300  000
            spectra.  The  latter  took  two  weeks  of  computation  with  the  current  non-
            parallelized R code.



                                                               238 | I S I   W S C   2 0 1 9
   244   245   246   247   248   249   250   251   252   253   254