Page 348 - Invited Paper Session (IPS) - Volume 2
P. 348
IPS273 Charlotte L. et al.
min 〈M, 〉 − λ(), (1)
∈Π(,)
where λ ≥ 0 is a regularization parameter. This regularization allows to
obtain smoother and more numerically stable solutions compared to the
original case. Indeed, Sinkhorn’s theorem Sinkhorn and Knopp (1967) tells us
that for λ > 0, (1) has a unique solution that can be obtained by left and right
scaling of the Gibbs kernel −/ (where the exponential is taken element-
wise) to the prescribed sums of the admissible couplings:
⋆
(, ) ≔ arg min 〈, 〉 − λ() = diag() −/ diag(),
∈Π(,)
where u and v are two non-negative scaling vectors uniquely defined up
to a multiplicative factor, that can be efficiently computed using Sinkhorn’s
algorithm.
3. Our Contributions
3.1 Co-clustering through Optimal Transport
Let us denote by X and Y two random variables taking values in the
ℎ
sets { } and {} , where , correspond to row (instance) and
=1
=1
column (feature), respectively. We further assume that the joint probability
distribution between X and Y denoted by (, ) is estimated from the
data matrix A. The problem of co-clustering consists in jointly grouping
the set of features and the set of instances into homogeneous blocks by
finding two assignment functions and that map as follows: ∶
{ , … , } → {̂ , … , ̂ }, ∶ { , … , ̂ } where g and k denote the
1
1
1
̂
number of row and columns clusters, and discrete random variables and
̂
̂
̂
represent the partitions induced by X and Y , i.e., = () = and =
().
The main idea is to consider the problem of finding a coupling
between the rows and columns of a given data matrix A, represented by a
uniform sum of m and n Diracs defined over a set of points in ℝ and ℝ
respectively. Formally, these rows and columns empirical measures can be
defined as:
∶= ∑ / and ∶= ∑ /.
=1 =1
For a given cost matrix M where stands for the distance between
,
⋆
line and column , the authors obtain an optimal coupling ( , )
using Sinkhorn given by:
⋆
( , ) = diag() −/ diag().
The obtained solution can be seen as a joint probability distribution
implying that the scaling vectors u and v can be seen as estimated rows
335 | I S I W S C 2 0 1 9