Page 249 - Special Topic Session (STS)

Page 249 - Special Topic Session (STS) - Volume 1

P. 249

STS426 Didier Fraix-Burnet
spectra of galaxies, the first study using k-means is recent (Sánchez Almeida
et al. 2010) and has been disputed in De et al. (2016). Indeed, the data set is
so large that this simple technique is not able to detect structures.
This prompted us to use more sophisticated techniques to automatically
and objectively build a statistical classification of galaxy spectra. Firstly, we
want to use unsupervised clustering since we are convinced that using
supervised learning is currently weird since the training set is devised by
human subjectivity. Data mining is a better approach to know what kind of
structures in the data set algorithms are able to detect. Secondly, the large
amount of parameters (wavelengths) is very probably redundant. Sánchez
Almeida, et al. (2010) as selected a priori physical interesting regions of the
spectra, but this introduces biases for the discriminative power of the
classification. Using dimensionality reduction techniques such as Principal
Component Analysis has only been used up to now to separate spectra of
galaxies from those of stars. However, principal components are known to be
unadequate to perform a clustering (Chang 1983). As a consequence, we have
chosen a discriminative latent subspace clustering approach designed both to
reduce the dimensionality and to perform an unsupervised clustering, as
described in the Methodology section.

2. Methodology
The data consists in 702248 spectra of galaxies and quasars with resdshift
smaller than 0.25 that were retrieved from the Sloan Digital Sky Survey (SDSS)
database, release 7 (http://www.sdss.org/dr7/). These data and their
preparation are described in De et al. (2016) except that we here do not select
wavelength bands to reduce the number of wavelengths initially of more than
3000 to around 1500. Rather, we applied a wavelet filtering of the noise
followed by a binning by a factor of 2.
The unsupervised clustering was performed with the algorithm FisherEM
available in R (Bouveyron & Brunet 2012). The Fisher-EM algorithm is a
discriminative latent mixture model that estimates both the discriminative
subspace and the parameters of the mixture model. It is based on the
Expectation-Maximization (EM) algorithm from which an additional step,
named F-step, is introduced, between the E- and the M-step. This F-step uses
the Fisher’s criterion under orthonormality constraints and conditionally to the
posterior probabilities to optimize the clustering.
Due to computational constraints and an algorithm currently written in R,
we analyzed several sets of 100 000 spectra, as well as one with 300 000
spectra. The latter took two weeks of computation with the current non-
parallelized R code.

238 | I S I W S C 2 0 1 9

244 245 246 247 248 249 250 251 252 253 254