Page 206 - Contributed Paper Session (CPS) - Volume 4
P. 206

CPS2182 Lynne Billard et al.





                                      Clustering of Interval-valued data
                                                                 2
                                                         1
                                             Lynne Billard , Fei Liu
                                               1  University of Georgia
                                                2  Bank of America

                  Abstract
                  The concept of symbolic data originates in Diday (1987). We consider cluster
                  methodology  for  intervals.  While  there  has  been  a  lot  of  activity  in  using
                  regression based algorithms to partition a data set into clusters for classical
                  data, no such algorithms have been developed for a set of interval-valued
                  observations. A new algorithm is proposed based on the k-means algorithm
                  of MacQueen (1967) and the dynamical partitioning method of Diday (1973)
                  and Diday and Simon (1976),  with the partitioning criteria  being based on
                  establishing regression models for each sub-cluster.

                  Keywords
                  Partitions; Regressions

                  1.  Introduction
                      With the advent of the modern computer, there has been an explosion in
                  the size of data sets across all scientific arenas. Analyses of such data sets
                  usually require aggregation in some form driven by the scientific questions
                  underlying these analyses. The aggregation perforce produces symbolic data
                  (such as lists, intervals, histograms, and the like) describing the observations
                  within  each  aggregated  class.  Thus,  instead  of  points  as  for  classical
                  observations,  observations  are  now  hypercubes  or  products  of  Cartesian
                  distributions, in p-dimensional space. Such data were originally introduced by
                  Diday (1987). We consider a dynamic partition of interval data using regression
                  criteria, in Section 2. After briefly describing the basics (in Section 2.1), the k-
                  means  and  k-regressions  algorithms  are  compared  in  Section  2.2.  The
                  performance of the k-regressions algorithm is then studied on different data
                  set structures, in Section 2.3. We conclude in Section 3.

                  2.  Regression-based Partitions
                      2.1  Basics
                      The -means algorithm was first introduced by MacQucen (1967). Charles
                  (1977)  extended  the  dynamical  algorithm  of  Diday  (1973)  and  Diday  and
                  Simon (1976) to build a regression- based algorithm for classical point data.




                                                                     195 | I S I   W S C   2 0 1 9
   201   202   203   204   205   206   207   208   209   210   211