Page 348 - Invited Paper Session (IPS)

Page 348 - Invited Paper Session (IPS) - Volume 2

P. 348

IPS273 Charlotte L. et al.
min 〈M, 〉 − λ(), (1)
∈Π(,)

where λ ≥ 0 is a regularization parameter. This regularization allows to
obtain smoother and more numerically stable solutions compared to the
original case. Indeed, Sinkhorn’s theorem Sinkhorn and Knopp (1967) tells us
that for λ > 0, (1) has a unique solution that can be obtained by left and right
scaling of the Gibbs kernel −/ (where the exponential is taken element-
wise) to the prescribed sums of the admissible couplings:

⋆
(, ) ≔ arg min 〈, 〉 − λ() = diag() −/ diag(),

∈Π(,)

where u and v are two non-negative scaling vectors uniquely defined up
to a multiplicative factor, that can be efficiently computed using Sinkhorn’s
algorithm.

3. Our Contributions
3.1 Co-clustering through Optimal Transport
Let us denote by X and Y two random variables taking values in the
ℎ
sets { } and {} , where , correspond to row (instance) and

=1
=1
column (feature), respectively. We further assume that the joint probability
distribution between X and Y denoted by (, ) is estimated from the
data matrix A. The problem of co-clustering consists in jointly grouping
the set of features and the set of instances into homogeneous blocks by
finding two assignment functions and that map as follows: ∶

{ , … , } → {̂ , … , ̂ }, ∶ { , … , ̂ } where g and k denote the
1

1
1
̂
number of row and columns clusters, and discrete random variables and
̂
̂
̂
represent the partitions induced by X and Y , i.e., = () = and =

().

The main idea is to consider the problem of finding a coupling
between the rows and columns of a given data matrix A, represented by a

uniform sum of m and n Diracs defined over a set of points in ℝ and ℝ

respectively. Formally, these rows and columns empirical measures can be
defined as:

∶= ∑ / and ∶= ∑ /.

=1 =1
For a given cost matrix M where stands for the distance between
,
⋆
line and column , the authors obtain an optimal coupling ( , )

using Sinkhorn given by:
⋆
( , ) = diag() −/ diag().

The obtained solution can be seen as a joint probability distribution
implying that the scaling vectors u and v can be seen as estimated rows
335 | I S I W S C 2 0 1 9

343 344 345 346 347 348 349 350 351 352 353