Page 335 - Contributed Paper Session (CPS)

Page 335 - Contributed Paper Session (CPS) - Volume 6

P. 335

CPS1950 Paolo G. et al.

Fuzzy clustering in a reduced subspace
Paolo Giordani, Maria Brigida Ferraro, Mario Fordellone, Maurizio Vichi
Sapienza University of Rome, Rome, Italy

Abstract
A general method for two-mode simultaneous reduction of observation units
and variables of a data matrix is introduced. It consists in a compromise
between the Reduced K-Means (RKM) and Factorial K-Means (FKM)
procedures. Both methodologies involve principal component analysis for
variables and K-Means for observation units, even though RKM aims at
maximizing the between-clusters deviance without imposing any condition on
the within-clusters deviance, while FKM aims at minimizing the within-clusters
deviance without imposing any condition on the between one. It follows that
RKM and FKM complement each other. In order to take advantage of both
methods a convex linear combination of the RKM and FKM loss functions is
used. Furthermore, the fuzzy approach to clustering is considered because of
its flexibility in handling the real-world complexity and uncertainty.

Keywords
Subspace clustering; Factorial K-Means; Reduced K-Means; Linear convex
combination; Fuzzy approach to clustering

1. Introduction
Clustering is the process of discovering a partition of I observation units in
a limited number of groups or clusters K (<I) such that observation units
belonging to the same cluster are similar according to a certain criterion. When
J quantitative variables are observed on the set of observation units, the most
common choice is to consider the (squared) Euclidean distance in order to
compute the dissimilarities between pairs of observation units. The probably
most famous clustering algorithm involving the squared Euclidean distance is
the K-Means (KM) algorithm (MacQueen, 1967). It provides a partition of the
observation units into K clusters, summarized by K centroids, in such a way to
minimize the within-cluster sum of squares.
The Euclidean distance is usually evaluated by considering all J variables.
This is inadequate when there exists a subset of variables, which does not play
a relevant role to properly recover the cluster structure and actually, tends to
mask it. To overcome this problem, several strategies can be adopted. Roughly
speaking, they consist in reweighting the variables in such a way to increase
or decrease their role in the clustering process. For every variable, the higher

324 | I S I W S C 2 0 1 9

330 331 332 333 334 335 336 337 338 339 340