Page 335 - Contributed Paper Session (CPS) - Volume 6
P. 335

CPS1950 Paolo G. et al.





                              Fuzzy clustering in a reduced subspace
                Paolo Giordani, Maria Brigida Ferraro, Mario Fordellone, Maurizio Vichi
                                  Sapienza University of Rome, Rome, Italy

            Abstract
            A general method for two-mode simultaneous reduction of observation units
            and  variables  of  a  data  matrix  is  introduced.  It  consists  in  a  compromise
            between  the  Reduced  K-Means  (RKM)  and  Factorial  K-Means  (FKM)
            procedures.  Both  methodologies  involve  principal  component  analysis  for
            variables  and  K-Means  for  observation  units,  even  though  RKM  aims  at
            maximizing the between-clusters deviance without imposing any condition on
            the within-clusters deviance, while FKM aims at minimizing the within-clusters
            deviance without imposing any condition on the between one. It follows that
            RKM and FKM complement each other. In order to take advantage of both
            methods a convex linear combination of the RKM and FKM loss functions is
            used. Furthermore, the fuzzy approach to clustering is considered because of
            its flexibility in handling the real-world complexity and uncertainty.

            Keywords
            Subspace clustering; Factorial K-Means; Reduced K-Means; Linear convex
            combination; Fuzzy approach to clustering

            1.  Introduction
                Clustering is the process of discovering a partition of I observation units in
            a  limited  number  of  groups  or  clusters  K  (<I)  such  that  observation  units
            belonging to the same cluster are similar according to a certain criterion. When
            J quantitative variables are observed on the set of observation units, the most
            common choice is to consider the (squared) Euclidean distance in order to
            compute the dissimilarities between pairs of observation units. The probably
            most famous clustering algorithm involving the squared Euclidean distance is
            the K-Means (KM) algorithm (MacQueen, 1967). It provides a partition of the
            observation units into K clusters, summarized by K centroids, in such a way to
            minimize the within-cluster sum of squares.
                The Euclidean distance is usually evaluated by considering all J variables.
            This is inadequate when there exists a subset of variables, which does not play
            a relevant role to properly recover the cluster structure and actually, tends to
            mask it. To overcome this problem, several strategies can be adopted. Roughly
            speaking, they consist in reweighting the variables in such a way to increase
            or decrease their role in the clustering process. For every variable, the higher


                                                               324 | I S I   W S C   2 0 1 9
   330   331   332   333   334   335   336   337   338   339   340