Page 206 - Contributed Paper Session (CPS) - Volume 4
P. 206
CPS2182 Lynne Billard et al.
Clustering of Interval-valued data
2
1
Lynne Billard , Fei Liu
1 University of Georgia
2 Bank of America
Abstract
The concept of symbolic data originates in Diday (1987). We consider cluster
methodology for intervals. While there has been a lot of activity in using
regression based algorithms to partition a data set into clusters for classical
data, no such algorithms have been developed for a set of interval-valued
observations. A new algorithm is proposed based on the k-means algorithm
of MacQueen (1967) and the dynamical partitioning method of Diday (1973)
and Diday and Simon (1976), with the partitioning criteria being based on
establishing regression models for each sub-cluster.
Keywords
Partitions; Regressions
1. Introduction
With the advent of the modern computer, there has been an explosion in
the size of data sets across all scientific arenas. Analyses of such data sets
usually require aggregation in some form driven by the scientific questions
underlying these analyses. The aggregation perforce produces symbolic data
(such as lists, intervals, histograms, and the like) describing the observations
within each aggregated class. Thus, instead of points as for classical
observations, observations are now hypercubes or products of Cartesian
distributions, in p-dimensional space. Such data were originally introduced by
Diday (1987). We consider a dynamic partition of interval data using regression
criteria, in Section 2. After briefly describing the basics (in Section 2.1), the k-
means and k-regressions algorithms are compared in Section 2.2. The
performance of the k-regressions algorithm is then studied on different data
set structures, in Section 2.3. We conclude in Section 3.
2. Regression-based Partitions
2.1 Basics
The -means algorithm was first introduced by MacQucen (1967). Charles
(1977) extended the dynamical algorithm of Diday (1973) and Diday and
Simon (1976) to build a regression- based algorithm for classical point data.
195 | I S I W S C 2 0 1 9