Page 348 - Invited Paper Session (IPS) - Volume 2
P. 348

IPS273 Charlotte L. et al.
                                      min 〈M, 〉 − λ(),                          (1)
                                    ∈Π(,)  

                     where  λ  ≥  0  is  a  regularization  parameter.  This  regularization  allows  to
                  obtain  smoother  and  more  numerically  stable  solutions  compared  to  the
                  original case. Indeed, Sinkhorn’s theorem Sinkhorn and Knopp (1967) tells us
                  that for λ > 0, (1) has a unique solution that can be obtained by left and right
                  scaling of the Gibbs kernel  −/  (where the exponential is taken element-
                  wise) to the prescribed sums of the admissible couplings:

                           ⋆
                           (, ) ≔ arg min 〈, 〉 − λ() = diag() −/  diag(),
                                                    
                           
                                       ∈Π(,)

                      where u and v are two non-negative scaling vectors uniquely defined up
                  to a multiplicative factor, that can be efficiently computed using Sinkhorn’s
                  algorithm.

                  3.  Our Contributions
                      3.1 Co-clustering through Optimal Transport
                         Let us denote by X and Y two random variables taking values in the
                                                                         ℎ
                      sets { }    and {}   , where  ,   correspond to  row (instance) and
                                                     
                                                       
                             =1
                                         =1
                      column (feature), respectively. We further assume that the joint probability
                      distribution between X and Y denoted by (, ) is estimated from the
                      data matrix A. The problem of co-clustering consists in jointly grouping
                      the set of features and the set of instances into homogeneous blocks by
                      finding  two  assignment  functions     and     that  map  as  follows:   ∶
                                                          
                                                                  
                                                                                           
                      {  , … ,  } → {̂ , … , ̂ },  ∶ { , … , ̂ }  where  g  and  k  denote  the
                                      1
                                                
                                                          
                              
                                            
                                                    1
                        1
                                                                                        ̂
                      number of row and columns clusters, and discrete random variables  and
                                                                                          ̂
                                                                         ̂
                      ̂
                       represent the partitions induced by X and Y , i.e.,  =  () = and  =
                                                                              
                       ().
                       
                         The  main  idea  is  to  consider  the  problem  of  finding  a  coupling
                      between the rows and columns of a given data matrix A, represented by a
                                                                                    
                      uniform sum of m and n Diracs defined over a set of points in ℝ  and ℝ
                                                                                           
                      respectively. Formally, these rows and columns empirical measures can be
                      defined as:
                                                                
                                         ∶= ∑  /  and  ∶= ∑  /.
                                         
                                                             
                                                                       
                                                   
                                             =1                 =1
                         For a given cost matrix M where   stands for the distance between
                                                           ,
                                                                                     ⋆
                      line   and  column  ,  the  authors  obtain  an  optimal  coupling   ( ,  )
                                                                                        
                                                                                    
                                                                                           
                      using Sinkhorn given by:
                       ⋆
                       ( ,  ) = diag() −/  diag().
                             
                          
                       
                         The obtained solution can be seen as a joint probability distribution
                      implying that the scaling vectors u and v can be seen as estimated rows
                                                                     335 | I S I   W S C   2 0 1 9
   343   344   345   346   347   348   349   350   351   352   353