Page 231 - Special Topic Session (STS) - Volume 1
P. 231

STS426 Tanuka C.
                where, ∥  ∥= √  ; the norm of , ( , . . . ,  ) are the data points and
                                
                                                             
                                                     1
                ( ,  .  .  .  ,  )  are  the  cluster  centers  for  the   clusters:  ,  .  .  .  ,  .
                                                                                     
                                                                           1
                  
                           
                                   2
                ∑   ∑   ∥  −  ∥    is the Euclidean Distance between a data point
                                 
                            
                 =1
                      =1
                                               ℎ
                  and cluster center   of the   cluster  . This is basically an indicator
                                     
                 
                                                         
                of the distance of the  data points from their respective cluster centers.
                Finally, the algorithm has the following steps:

               3.2 The Mixture-Model Based Clustering Technique
                   K − Means clustering is an iterative relocation method which minimizes
               the intra-cluster variance. Model Based Clustering (MBC) is also an iterative
               method  but  unlike  K  −Means,  it  has  the  provision  for  variability  and
               structure
               of  the  data.  In  finite  mixture  model  based  clustering,  each  of  the
               component  probability  distribution  corresponds  to  a  cluster.  The  usual
               questions in applied cluster analysis, i.e., choice of appropriate clustering
               method and determination of number of clusters, can be reformulated as a
               Statistical Model Selection problem where models that differ in number of
               components and/or in component distribution can be compared. Outliers
               as well can conveniently modeled by adding one or more component(s)
               representing a different distribution for the outlying data (?).
                   As  already  noted,  K  −  Means  assumes  homogeneous  and  spherical
               groups/clusters. This can be viewed as a procedure which approximately
               maximizes  the  multivariate  normal  classification  likelihood  when  the
               covariance matrix is equal for each of the mixing component probability
               distributions and is proportional to the identity matrix. On the other hand,
               MBC  can  tackle  the  problem  of  overlapping  and  non-spherical  clusters
               having different covariance structures. (?).
                   Suppose we have the data:  = { , . . . ,  } where   is a d-dimensional
                                                          
                                                   1
                                                                    
               vector. Now, for a given number of components of length G, assume the
               points are generated in an  (independently and identically distributed)
               manner from the finite-mixture model:
                                               
                                     (|) = ∑   (| )                    (4)
                                                          
                                                    
                                              =
               where,   (| )  represents  the  density  of  the   ℎ   group/cluster
                              
                         
               parameterized  by     .      :  = Pr ( ∈     is  called  the  mixing
                                                        
               proportion/weight where ∑     = 1. The complete set of parameters for
                                              
                                          =1
               a mixture-model with G components is:

                                        = { , . . . ,  ,  , . . . ,  }
                                                              
                                                    
                                                       
                                              
                   Most  often  and  throughout  the  rest  of  this  work,   is  taken  to  be
                                                                       
               Multivariate Normal (Gaussian) distribution (| , ∑ ), parameterized by
                                                                
                                                                   
                                                               220 | I S I   W S C   2 0 1 9
   226   227   228   229   230   231   232   233   234   235   236