Correspondence between symmetric NMF, kmeans, biclustering and spectral clustering
August 31, 2022 42 min read
Nonnegative matrix factorization (NMF) is the most fashionable method of matrix decomposition among productive, but mathematically illiterate biology students, much more popular than PCA due to its perceived simplicity. However, if you start digging deeper, it happens to have profound connections to a multitude of other techniques from linear algebra and matrix analysis. In this post I discuss those connections.
I am following the general logic of this paper, including the numeration of chapters, in this post. However, I will provide lots of my own remarks, as I spent several months, working with objects, discussed in this post and “rehydrate” this very clear, but somewhat condensed 4.5page paper into a much larger blog post.
In this post I will present 4 seemingly different algorithms: symmetric NMF, kmeans, biclustering and spectral clustering via normalized cuts method. I will establish the correspondence between all 4 methods and show that essentially they represent variations of the same mathematics.
1. Introduction to NMF and kmeans
NMF and recommender systems
Consider an $n \times p$ matrix $V = \begin{pmatrix} v_{1,1} && v_{1,2} && v_{1,3} && v_{1,p} \\ ... && ... && ... && ... \\ v_{n,1} && v_{n,2} && v_{n,3} && v_{n,p} \end{pmatrix}$.
Oftentimes you would want to approximate it by a lowrank matrix $\hat{V}$. E.g. suppose that “lowrank” means rank $k$:
$\hat{V} = \begin{pmatrix}w_{1,1} \\ w_{1,2} \\ w_{1,n}\end{pmatrix} \cdot \begin{pmatrix}h_{1,1} && h_{1,2} && h_{1,p}\end{pmatrix} + ... + \begin{pmatrix}w_{k,1} \\ w_{k,2} \\ w_{k,n}\end{pmatrix} \cdot \begin{pmatrix}h_{1,1} && h_{2,1} && h_{p,1}\end{pmatrix}$
Here I use outer product notation to represent $\hat{V}$ as a product of two rectangular matrices: $n \times k$ matrix $W$ and $k \times p$ matrix $H$:
$\hat{V} = W \cdot H = \begin{pmatrix} w_{1,1} && w_{1,k} \\ w_{2,1} && w_{2,k} \\ w_{n,1} && w_{n,k} \end{pmatrix} \cdot \begin{pmatrix} h_{1,1} && h_{1,2} && h_{1,p} \\ h_{k,1} && h_{k,2} && h_{k,p} \end{pmatrix}$
Since the Netflix challenge, this approach was very popular in the recommender systems. If $V$ matrix represents $n$ users, who assigned ratings to $p$ movies, then $W$ matrix represents the characterization of each of the $n$ users in terms of $k$ psychological traits (e.g. how much the user likes humour, violence, sentimentality etc.) and $H$ matrix characterizes each of $p$ movies in terms of these $k$ scales  how humorous, violent, sentimental etc. they are.
Then an obvious application of NMF is data imputation  most of the user reviews are missing, hence, we need to predict user ratings for movies they haven’t seen and suggest the movies with the highest expected ratings.
How would we search for NMF in practice? We would require Frobenius norm of the difference between NMF and the original matrix $V$ to be as small as possible:
$\min \limits_{W,H \ge 0} V  WH _F = \sum \limits_{i=1}^n \sum \limits_{j=1}^p (v_{i,j}  \langle {\bf w_{i,k}}, {\bf h_{k,j}} \rangle)^2$
Frobenius norm is just a square root of sum of squares of elements of matrix (if we considered the matrix a regular vector, it would’be been its L2 norm). Thus it could be interpreted as the distance between matrices.
However, Frobenius norm is a good measure for the quality of our approximation not only because of its intuitiveness. A family of theorems in matrix analysis (such as HoffmanWielandt inequality) show that if matrix approximation converges to the true matrix in terms of Frobenius norm, then so do their eigenvalues and eigenvectors.
Another important way of representing the Frobenius norm is through the trace of Gram matrix, created out of the initial matrix: $V  WH_F = \sqrt{Tr{(V  WH)^T (VWH)}}$. We will heavily rely on this representation in this post.
An alternative option for the measure of quality of approximation is KullbackLeibler divergence $D(V  W H)$, which in matrix case evaluates to $D(AB) = \sum \limits_{i,j} (A_{i,j} \log \frac{A_{i,j}}{B_{i,j}}  A_{i,j} + B_{i,j})$. In this case NMF bears similarities to various Bayesian methods, such as Probabilistic Latent Semantic Analysis (PLSA) method. However, I won’t delve into KullbackLieblerminimizing approach and will focus on the Frobenius norm as approximation quality metric.
NMF solver
One of the reasons of NMF popularity is its simple iterative solver.
In a series of alternating steps it updates the values of $W$ and $H$ matrices using the following update rules:
$W \leftarrow W \cdot \frac{(VH^T)}{(WHH^T)}$
$H \leftarrow H \cdot \frac{(W^TV)}{(W^TWH)}$
Let us derive this formula.
We want to solve the following optimization problem:
$\min \limits_{W \ge 0, H \ge 0} V  WH _F$
Or, equivalently, expressing Frobenius norm through trace notation, we have:
$W^*, H^* = \arg \min \limits_{W \ge 0, H \ge 0} \sqrt{Tr((V  WH)^T(V  WH))} = \arg \min \limits_{W \ge 0, H \ge 0} Tr((V  WH)^T(V  WH))$
This is a minimization problem which can be solved through taking a derivative of a scalar function with respect to a matrix argument (the concept and mechanics of derivative of a scalar function with respect to a matrix argument is explained in detail here and here).
We will be doing a gradient descent, iteratively decreasing Frobenius norm with respect to $W$ matrix with $H$ fixed every odd step and with respect to $H$ matrix with $W$ matrix fixed every even step:
$W \leftarrow W  \eta_W \cdot \nabla_W f(W,H)$
$H \leftarrow H  \eta_H \cdot \nabla_H f(W,H)$
Here $\eta_W$ and $\eta_H$ are gradient step size.
Now let us decompose our trace into 4 terms and take its derivative $\nabla_W Tr((V  WH)^T(V  WH))$ and $\nabla_H Tr((V  WH)^T(V  WH))$ with respect to matrices $W$ and $H$:
$Tr((V^T  H^TW^T)(VWH) ) = Tr( V^TV  V^TWH  H^TW^TV + H^TW^WWH) =$
$= Tr( V^TV)  Tr(V^TWH)  Tr(H^TW^TV) + Tr(H^TW^TWH)$
Taking derivative of each term separately:
$\nabla_W Tr( V^TV) = 0$
$\nabla_W Tr(V^TWH) = \nabla_W Tr(HV^T W) = (H V^T)^T = VH^T$ (using the fact that $\nabla_W Tr(AW) = A^T$)
$\nabla_W Tr(H^TW^TV) = \nabla_W Tr(W^T V H^T) = VH^T$ (using the fact that $\nabla_W Tr(W^TA) = A$)
$\nabla_W Tr(H^TW^TWH) = \nabla Tr(HH^TW^TW) = 2 W HH^T$ (using the fact that $\nabla_W Tr(W HH^T W^T) = W (HH^T + HH^T)$)
Aggregating the results:
$\nabla_W Tr((V  WH)^T(V  WH)) = 0  VH^T  VH^T + 2 W HH^T = 2 (W HH^T  VH^T) = 2 (WH  V) H^T$
And carrying out similar calculations for $H$ we get:
$\nabla_H Tr((V  WH)^T(V  WH)) = 2 W^T V + 2 W^TW H = 2 W^T (WH  V)$
Substituting this back into the original update formula (ignoring “2”, as it can be included into the gradient step):
$W \leftarrow W  \eta_W \cdot (WH  V) H^T$
$H \leftarrow H  \eta_H \cdot W^T (WH  V)$
Now the authors of the original paper suggest to set specific values of gradient update steps:
$\eta_W = \frac{W}{WHH^T}$
$\eta_H = \frac{H}{W^TWH}$
This leads us to the declared algorithm:
$W \leftarrow W  \frac{W}{W H H^T} (WH  V)H^T = \frac{W}{WHH^T} \cdot (\cancel{WHH^T}  \cancel{WHH^T} + VH^T) = W \cdot \frac{VH^T}{WHH^T}$
$H \leftarrow H  \frac{H}{W^T W H} W^T (WH  V) = H \cdot \frac{W^TW}{W^TWH}$
Here is a python implementation:
from typing import Tuple
import numpy as np
import numpy.typing as npt
class ConvergenceException(Exception):
def __init__(self):
Exception.__init__(self, "Iterations limit exceeded; algorithm failed to converge!")
def nmf(V: npt.NDArray, k: int, tol: float = 1e4, max_iter: int=100) > Tuple[npt.NDArray, npt.NDArray]:
"""An NMF solver implementation which approximates an n x p input matrix V
as a product W x H of n x k matrix W and k x p matrix H. This solver
minimizes Frobenius norm of the difference between V and W H and implements
an algorithm, described by Lee and Seung (2000) in:
https://proceedings.neurips.cc/paper/2000/file/f9d1152547c0bde01830b7e8bd60024cPaper.pdf.
:param V: n x p matrix to be approximated with NMF lowrank approximation
:param k: rank of lowrank approximation
:param tol: threshold value of decrease in energy function between two consecutive iterations for the algorithm to
be considered converged
:param max_iter: maximum number of iterations allowed; if reached before convergence, ConvergenceException is thrown
:return: 2tuple of two matrices:
 W  n x k matrix
 H  k x p matrix
"""
n = V.shape[0]
p = V.shape[1]
previous_energy = np.inf
next_energy = np.inf
iterations = 0
W = np.ones((n, k))
H = np.ones((k, p))
while (previous_energy  next_energy > tol) or next_energy == np.inf:
W_update = np.dot(V, H.T) / np.dot(W, np.dot(H, H.T))
W = W * W_update
H_update = np.dot(W.T, V) / np.dot(W.T, np.dot(W, H))
H = H * H_update
if iterations > max_iter:
raise ConvergenceException
else:
iterations += 1
previous_energy = next_energy
next_energy = np.linalg.norm(V  np.dot(W, H), 'fro') # calculate Frobenius norm
return W, H
If we run this code, the approximation is very decent:
import unittest
import numpy as np
from ..biclustering_lasso import nmf
class NMFTestCase(unittest.TestCase):
"""
python3 m unittest biclustering_lasso.tests.test_nmf.NMFTestCase
"""
def test_nmf(self):
V = np.array([[2.1, 0.4, 1.2, 0.3, 1.1],
[2.1, 0.7, 2.3, 0.4, 2.2],
[2.4, 0.5, 3.2, 0.7, 3.3]])
W, H = nmf(V, k=2)
print(f"W = {W}")
print(f"H = {H}")
>>> W
array([[0.49893966, 0.49893966],
[0.76064226, 0.76064226],
[1.02271335, 1.02271335]])
>>> H
array([[1.36102025, 0.33184111, 1.50013542, 0.31221326, 1.49381373],
[1.36102025, 0.33184111, 1.50013542, 0.31221326, 1.49381373]])
>>> np.dot(W, H)
array([[1.35813395, 0.33113738, 1.49695411, 0.31155116, 1.49064583],
[2.07049903, 0.50482475, 2.2821328 , 0.47496521, 2.27251571],
[2.78386716, 0.67875667, 3.06841706, 0.63860935, 3.05548651]])
>>> V
array([[2.1, 0.4, 1.2, 0.3, 1.1],
[2.1, 0.7, 2.3, 0.4, 2.2],
[2.4, 0.5, 3.2, 0.7, 3.3]])
As you can see, np.dot(W, H)
approximates V
very well. And due to the fact that it is so easy to implement, it is
probably the favourite algorithm of your favourite bioinformatics student:
NMF as a special case of kmeans
A less obvious application of NMF is data clustering. Turns out, kmeans clustering is a special case of NMF.
Suppose that we have identified $k$ clusters $C_k$ among our data points $\bf v_i$.
We can interpret the rows ${\bf h_k}$ of matrix $H$ as centroids of our clusters: ${\bf h_k} = \sum_{i \in C_k }{\bf v_i}$
Then matrix $W$ can be made orthogonal nonnegative, representing attribution of data points to clusters. E.g.:
$\hat{V} = W \cdot H = \begin{pmatrix} 1 && 0 \\ 0 && \frac{1}{\sqrt{2}} \\ 0 && \frac{1}{\sqrt{2}} \end{pmatrix} \cdot \begin{pmatrix} h_{1,1} && h_{1,2} && h_{1,p} \\ h_{k,1} && h_{k,2} && h_{k,p} \end{pmatrix}$
Here matrix $W$ describes two clusters, the first contains data point {$\bf v_1$} and the second  {$\bf v_2$, $\bf v_3$}. If all the coordinates of the data are nonnegative, this means that coordinates of all the cluster centroids $\bf h_i$ are nonnegative as well. $W$ is nonnegative, too. So, all the requirements of NMF are satisfied.
kmeans solution with EMlike algorithm
In practice, kmeans can be solved with a twostep iterative EMlike algorithm.
Initialize cluster centroids with random values (obviously, we can get smarter with initialization, but even random will do for now).
Each iteration of EM algorithm consists of two steps:
 Estep: assign each data point to the cluster with the nearest centroid
 Mstep: recalculate the coordinates of each centroid as a center of mass of the data points, which belong to its cluster
This algorithm converges, because there exists a nonnegative monotonically decreasing (nonincreasing) “energy” function, which decreases at each iteration, both on Estep and Mstep.
The exact choice of the energy function could vary, and depending on the one selected, the algorithm takes new interesting interpretations. For instance, NMF with KullbackLeibler divergence as energy function results in intepretaiton of NMF as probabilistic latent semantic analysi (PLSA) algorithm. We are going to use Frobenius norm as the energy function, which results in a multitude of truncated PCA/biclustering/spectral clustering interpretations.
Here is an implementation of kmeans (with kmeans++ initialization):
import math
import random
from typing import Tuple, Union
import numpy as np
import numpy.typing as npt
class ConvergenceException(Exception):
def __init__(self):
Exception.__init__(self, "Iterations limit exceeded; algorithm failed to converge!")
def k_means_clustering(
X: npt.NDArray,
k: int,
tol: float = 1e5,
max_iter: int = 100,
energy: Union['KL', 'fro'] = 'fro'
) > Tuple[npt.NDArray, npt.NDArray]:
"""A minimal implementation of kmeans clustering algorithm.
:param X: n x p data matrix, where each point of data is a pdimensional np.array
:param k: number of cluster centroids, we aim to find
:param tol: tolerance in energy; stop and return result, if the decrease in energy between 2 steps is below tol
:param max_iter: maximum number of iterations of the algorithm allowed; abort with ConvergenceException if exceeded
:param energy: monotonically nonincreasing energy function to calculate (options: KullbackLeibler, Frobenius norm)
:return: (centroids, clusters)  a 2Tuple of 2 matrices:
 centroids  k x p matrix of cluster centroids
 clusters  n x k indicator vectors, which define a set of data points, which belongs to each cluster
"""
n = X.shape[0]
p = X.shape[1]
centroids = init_centroids(X, k)
clusters = np.empty(shape=(n, k))
iterations = 0
next_energy = np.Inf
previous_energy = np.Inf
while not (previous_energy  next_energy < tol):
clusters = e_step(X, centroids)
centroids = m_step(X, centroids, clusters)
if iterations > max_iter:
raise ConvergenceException
else:
iterations += 1
previous_energy = next_energy
next_energy = calculate_energy(X, centroids, clusters, energy)
return centroids, clusters
def init_centroids(X, k) > npt.NDArray:
"""Initialization procedure for settings the initial locations of
centroids, based on kmeans++ (20062007) algorithm:
https://en.wikipedia.org/wiki/Kmeans%2B%2B.
"""
n = X.shape[0]
p = X.shape[1]
centroids = np.zeros((1, p))
random_point_index = random.randrange(n)
np.copyto(centroids[0], X[random_point_index]) # use a random row of X matrix as a centroid
for _ in range(k1):
# find closest centroid to each data point
clusters = e_step(X, centroids)
# construct probability distribution of selecting data points as centroids
distribution = np.zeros(n)
for index, data_point in enumerate(X):
# find the coordinate of 1 in clusters[index]  it will be the index of centroid
nearest_centroid_index = np.argmax(clusters[index]) # finds the location of 1
nearest_centroid = centroids[nearest_centroid_index]
# probability of a point being selected as a new centroid is ~ square distance D(point, centroid)
distribution[index] = np.dot(data_point  nearest_centroid, data_point  nearest_centroid)
# pick a data point to be the next centroid at random
new_centroid = random.choices(X, weights=distribution)
centroids = np.vstack([centroids, new_centroid])
return centroids
def e_step(X, centroids):
"""Assign each data point to a cluster with the nearest centroid."""
clusters = np.zeros(shape=(X.shape[0], centroids.shape[0]))
for data_point_index, data_point in enumerate(X):
nearest_centroid_index = 0
shortest_distance_to_centroid = np.infty
for index, centroid in enumerate(centroids):
direction = data_point  centroid
distance = math.sqrt(np.dot(direction, direction))
if distance < shortest_distance_to_centroid:
shortest_distance_to_centroid = distance
nearest_centroid_index = index
clusters[data_point_index][nearest_centroid_index] = 1
return clusters
def m_step(X, centroids, clusters):
"""Recalculate new centroids based on the updated clusters."""
# divide each cluster element by the number of elements in it for averaging (e.g. [0, 1, 1] > [0, 0.5, 0.5])
normalized_clusters = clusters.T
for index, cluster in enumerate(normalized_clusters):
np.copyto(normalized_clusters[index], cluster / cluster.sum())
normalized_clusters = normalized_clusters.T
# move each centroid to the center of mass of its respective cluster
new_centroids = np.dot(X.T, normalized_clusters).T
return new_centroids
def calculate_energy(X, centroids, clusters, energy: Union['KL', 'fro']) > float:
"""Implementation of several energy functions calculation."""
if energy == 'fro':
result = np.linalg.norm(X  np.dot(clusters, centroids), 'fro')
elif energy == 'KL':
result = 0 # TODO
else:
raise ValueError(f"Undefined energy function type '{energy}'")
return result
Let us test our implementation:
import unittest
import numpy as np
from ..biclustering_lasso import k_means_clustering
class KMeansTestCase(unittest.TestCase):
"""
python3 m unittest biclustering_lasso.tests.test_nmf.NMFTestCase
"""
def test_k_means(self):
X = np.array([[2.1, 0.4, 1.2, 0.3, 1.1],
[2.1, 0.7, 2.3, 0.4, 2.2],
[2.4, 0.5, 3.2, 0.7, 3.3]])
centroids, clusters = k_means_clustering(X, k=2)
print(f"centroids = {centroids}")
print(f"clusters = {clusters}")
>>> centroids
array([[2.1 , 0.4 , 1.2 , 0.3 , 1.1 ],
[2.25, 0.6 , 2.75, 0.55, 2.75]])
>>> clusters
array([[1. , 0. ],
[0. , 0.5],
[0. , 0.5]])
>>> np.dot(clusters, centroids)
array([[2.1 , 0.4 , 1.2 , 0.3 , 1.1 ],
[1.125, 0.3 , 1.375, 0.275, 1.375],
[1.125, 0.3 , 1.375, 0.275, 1.375]])
>>> X
array([[2.1, 0.4, 1.2, 0.3, 1.1],
[2.1, 0.7, 2.3, 0.4, 2.2],
[2.4, 0.5, 3.2, 0.7, 3.3]])
Again, we get a decent lowrank approximation of X
matrix by kmeans cluster np.dot(clusters, centroids)
.
Symmetric NMF
NMF is not unique. For instance, you could obviously insert a matrix and its inverse in between $W$ and $H$:
$V = WH = W B B^{1} H = \hat{W} \hat{H}$.
Hence, there are plenty of options for additional constraints on $W$ and $H$.
You could demand that one of them, $W$ or $H$ is orthogonal. This special case leads to kmeans interpretation (where the orthogonal matrix represents clusters and the other one  coordinates of centroids).
Or you can demand that NMF is symmetric and $W = H^T$. This special case is called symmetric NMF and will be considered further.
2. Symmetric NMF interpretation through kmeans
It turns out that symmetric NMF algorithm can be interpreted as a variation of “soft” kmeans clustering.
In “hard” kmeans clustering each data point belongs to strictly one cluster. In “soft” clustering every point is assigned weights, which show how close (or likely to belong) it is to each cluster.
In “hard” clustering clusters matrix is orthogonal. As we’ll see further in this post, in “soft” clustering weights matrix is not orthogonal, but tends to be approximately orthogonal.
In order to establish correspondence between symmetric NMF and kmeans I will first need a technical lemma, which shows that minimum of Frobenius norm of error of rank$1$ matrix approximation coincides with the maximum of quadratic form of the approximated matrix. Moreover, in case of rank$k$ matrix approximation via $\hat{V} = W \cdot H$, minimum of Frobenius norm of error $V  \hat{V}_F$ corresponds to a trace $Tr(H V W)$.
This fact will prove useful for all the 4 algorithms, described in this post.
Lemma 2.0. Minimum of Frobenius norm corresponds to the maximum of Rayleigh quotient
Assume that matrices $W$ and $H^T$ are orthogonal, i.e.
$\langle {\bf w_i}, {\bf w_j} \rangle = \begin{cases}0, i \ne j \\ 1, i = j \end{cases}$
$\langle {\bf h_i}, {\bf h_j} \rangle = \begin{cases}0, i \ne j \\ 1, i = j \end{cases}$
Then the following holds:
$W^*, H^* = \arg \min \limits_{W, H}  V  WH _F = \arg \max \limits_{W, H} Tr (H V W)$,
or, in case of rank $k=1$, and single vectors $\bf w$ and $\bf h$ we get:
${\bf w}^*, {\bf h}^* = \arg \min \limits_{ {\bf w}, {\bf h} }  V  {\bf w} {\bf h}^T _F = \arg \max \limits_{ {\bf w}, {\bf h} } {\bf h}^T V {\bf w}$,
Let us show this result for the rank $k=1$ case (a pair of vectors) and then generalize to arbitrary rank.
$ V  {\bf w}{\bf h}^T _F = Tr (V  {\bf w}{\bf h}^T)^T (V  {\bf w}{\bf h}^T) = \underbrace{Tr (V^T V)}_\text{const}  \underbrace{Tr(V^T {\bf w}{\bf h}^T)}_{Tr({\bf h}^T V^T {\bf w}) = Tr({\bf w}^T V {\bf h}) }  \underbrace{Tr({\bf h} {\bf w}^T V )}_{Tr({\bf w}^T V {\bf h})} + \underbrace{ Tr({\bf h} {\bf w}^T {\bf w} {\bf h}^T) }_{= {\bf w}^T{\bf w} \cdot Tr({\bf h}{\bf h}^T) = w_2 \cdot h_2 = 1}$.
Hence, $ V  {\bf w}{\bf h}^T _F$ attains its minimum when $ 2 \cdot Tr({\bf w}^T V {\bf h})$ attains its minimum or $Tr({\bf w}^T V {\bf h})$ attains its maximum.
Now consider $k=2$ case ($k=3, 4, ...$ etc. works similarly):
$ V  WH _F = Tr ((V  WH)^T (V  WH)) = \underbrace{Tr (V^T V)}_\text{const}  Tr (V^T WH)  Tr(H^T W^T V) + \underbrace{Tr(H^T W^T W H)}_{const} =  2 Tr(W^T V H^T) + const$.
Lemma 2.1. Symmetric NMF is equivalent to kernel Kmeans clustering
Consider a special case of NMF, when $W = H^T$, called symmetric NMF. In this case we’re looking for an approximation:
$V = H^T H$
And the goal is to find $H^* = \arg \min \limits_{H: H \ge 0} V  H^T H _F$.
If we assume that $H$ is orthogonal ($H^T H = I$, where $I$ is identity matrix), then by Lemma 2.0 we immediately see that:
$H^* = \arg \min \limits_{H: H \ge 0, H^T H = I} V  H^T H _F = \arg \max \limits_{H: H \ge 0, H^T H = I} H V H^T$.
Lemma 2.2. Symmetric NMF matrix is near orthogonal
Interestingly, we might not need to explicitly impose the requirement that $H^T H = I$, yet still $H$ turns out to be approximately orthogonal, if we find it through a symmetric NMF.
$H^* = \arg \min \limits_{H}V  H^T H_F$
$V  H^T H_F = Tr (V  H^T H)^T(V  H^T H) = \underbrace{Tr(V^T V)}_\text{const}  Tr((H^T H)^T V)  Tr(V^T (H^T H)) + Tr(H^T H H^T H) = const 2 Tr((H^T H)^T V) + Tr(H^T H H^T H) = const 2 Tr(HVH^T) +  H H^T _F$
Here the term $Tr(HVH^T)$ corresponds to the minimization problem, similar to kmeans. The term $ H H^T _F$ works almost like a constraint. Let’s focus on it:
$H H^T = \begin{pmatrix} h_{1,1} && h_{1,2} && h_{1,3} && h_{1,4} \\ h_{2,1} && h_{2,2} && h_{2,3} && h_{2,4} \end{pmatrix} \cdot \begin{pmatrix} h_{1,1} && h_{2,1} \\ h_{1,2} && h_{2,2} \\ h_{1,3} && h_{2,3} \\ h_{1,4} && h_{2,4} \end{pmatrix} = \begin{pmatrix} \langle {\bf h_1}, {\bf h_1} \rangle && \langle {\bf h_1}, {\bf h_2} \rangle \\ \langle {\bf h_2}, {\bf h_1} \rangle && \langle {\bf h_2}, {\bf h_2} \rangle \end{pmatrix}$,
where e.g. ${\bf h_1} = \begin{pmatrix} h_{1,1} \\ h_{1,2} \\ h_{1,3} \\ h_{1,4} \end{pmatrix}$.
From the definition of Frobenius norm then $ H H^T _F = \langle {\bf h_1}, {\bf h_1} \rangle^2 + \langle {\bf h_1}, {\bf h_2} \rangle^2 + ... + \langle {\bf h_2}, {\bf h_1} \rangle^2 + \langle {\bf h_2}, {\bf h_2} \rangle^2 + ... + \langle {\bf h_4}, {\bf h_4} \rangle^2 = \sum \limits_{k=1}^p \sum \limits_{l=1}^n \langle {\bf h_k}, {\bf h_l} \rangle^2$.
Let us split this sum into two parts: diagonal elements and nondiagonal elements separately:
$ H H^T _F = \sum \limits_{l \ne k} \langle {\bf h_l}, {\bf h_k} \rangle^2 + \sum \limits_{l=k} \langle {\bf h_k}, {\bf h_k} \rangle^2$
Minimization of the term $\sum \limits_{l \ne k} \langle {\bf h_l}, {\bf h_k} \rangle^2$ essentially enforces approximate orthogonality, while the second term is responsible for normalization $\sum \limits_{l=k} \langle {\bf h_k}, {\bf h_k} \rangle^2 = \sum \limits_{l=k} {\bf h_k}^4_2$. This term attains minimum when all the norms of vectors are approximately balanced (${\bf h_1}_2 \approx {\bf h_2}_2 \approx ... \approx {\bf h_k}_2$).
3. Biclustering problem and quadratic programming/NMF
Another related problem is the problem of biclustering.
Suppose that you have a matrix with expressions of $p$ genes, measured in $n$ patients, and you want to find subgroups of patients with similar patterns of expression (e.g. you’re looking for subtypes of cancer).
So, you want to simultaneously detect subsets of columns and subsets of rows, such that they explain e.g. a large chunk of variance in your data (the variance in a bicluster is expected to be low, though).
Equivalently, this problem can be reformulated as detection of dense subgraphs in a bipartite graph
Lemma 3.1. Biclustering problem is equivalent to quadratic programming/NMF lowrank approximation of a JordanWielandt matrix
Let us formalize the biclustering problem.
We need to find a pair of indicator vectors
${\bf x}: x_i = \begin{cases} 1, i \in A \\ 0, i \in B \end{cases}$
${\bf y}: y_i = \begin{cases} 1, i \in A \\ 0, i \in B \end{cases}$
such that ${\bf x}^T V {\bf y} \to \max / \min$
I call this bilinear programming, as we need to optimize the value of a bilinear form. In a more general case we might need to find matrices ($X$, $Y$) (i.e. multiple pairs ($\bf x_i$, $\bf y_i$), potentially covering the whole matrix; vector pairs may intersect or may not, depending on the problem statement).
Turns out, we can reformulate this bilinear optimization as a quadratic optimization problem. Let us put the vectors $x$ and $y$ on top of each other, concatenating them into a unified vector $h = (x_1, x_2, ..., x_n, y_1, y_2, ..., y_p)^T$.
Replace the $V$ matrix with the following JordanWielandt matrix:
$B = \begin{pmatrix} 0 && V \\ V^T && 0 \end{pmatrix}$
Then our optimization problem is to find $\frac{1}{2}{\bf h}^T B {\bf h} \to \max / \min$.
Again, if we want to find multiple biclusters, instead of a single vector $h$, we use a whole matrix $H$, and then our optimization problem takes a familiar form $\max \limits_{H} \frac{1}{2} Tr H^T B H$.
This problem already looks awfully familiar, right? Let us consider one more face of it.
4. kmeans corresponds to spectral clustering
Another interesting interpretation of the same algorithm comes from spectral graph theory.
To get a taste of this field first, suppose that we have a graph that consists of multiple disjoint subsets.
It can be shown, that each disjoint connected component in this graph is an eigenvector of socalled graph Laplacian.
For example, consider this disjoint graph, its degree matrix $D$, adjacency matrix $A$ and Laplacian $L = D  A$:
$D = \begin{pmatrix} 2 && 0 && 0 && 0 && 0 \\ 0 && 2 && 0 && 0 && 0 \\ 0 && 0 && 2 && 0 && 0 \\ 0 && 0 && 0 && 1 && 0 \\ 0 && 0 && 0 && 0 && 1 \end{pmatrix}$, $A = \begin{pmatrix} 0 && 1 && 1 && 0 && 0 \\ 1 && 0 && 1 && 0 && 0 \\ 1 && 1 && 0 && 0 && 0 \\ 0 && 0 && 0 && 0 && 1 \\ 0 && 0 && 0 && 1 && 0 \end{pmatrix}$, $L = \begin{pmatrix} 2 && 1 && 1 && 0 && 0 \\ 1 && 2 && 1 && 0 && 0 \\ 1 && 1 && 2 && 0 && 0 \\ 0 && 0 && 0 && 1 && 1 \\ 0 && 0 && 0 && 1 && 1 \end{pmatrix}$
It is easy to see that vectors $\begin{pmatrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \end{pmatrix}$ and $\begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \\ 1 \end{pmatrix}$, which define two connected components, are eigenvectors with 0 eigenvalue for graph Laplacian:
$\begin{pmatrix} 2 && 1 && 1 && 0 && 0 \\ 1 && 2 && 1 && 0 && 0 \\ 1 && 1 && 2 && 0 && 0 \\ 0 && 0 && 0 && 1 && 1 \\ 0 && 0 && 0 && 1 && 1 \end{pmatrix} \cdot \begin{pmatrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \end{pmatrix} = 0 \cdot \begin{pmatrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \end{pmatrix}$
Now, instead of strictly disjoint graph, we might have a loosely disjoint one  with relatively dense subgraphs, interconnected with just a few “bridge” edges. We might want to find those dense subgraphs. And they would serve as good clusters as well, if we reformulated the problem as clustering?
Lemma 4.0. MinMax Cut solution through optimization of a quadratic form with Laplacian matrix
In 1993 Wu and Leahy suggested an algorithm of finding a minimal/maximal cut in graphs, based on spectral clustering.
The cut was meant to be the set of edges, that would split the graph into 2 halves, $A$ and $B$, so that the sum of edges, which connect these two parts, is minimal: $cut(A, B) = \sum \limits_{u \in A, v \in B} w(u, v)$.
Speaking in terms of Laplacian and spectral graph theory, we can reformulate MinCut optimization problem as follows. We introduce indicator vectors ${\bf h} = (1, 1, ..., 1, 1, 1, ..., 1)$, where $h_i = 1$, if vertex $i \in A$ and $h_i = 1$, if $i \in B$.
Now, you can see that if we defined some partition of our graph $V$ into halves $A$ and $B$, adjacency matrix gets split into 4 blocks: block diagonal elements of the matrix represent nodes within $A$ and $B$ subgraphs, and offdiagonal elements represent $cut(A, B)$:
Now consider a quadratic form ${\bf h}^T A {\bf h}$:
${\bf h}^T A {\bf h} = \begin{pmatrix} 1 && 1 && 1 && 1 && 1 \end{pmatrix} \cdot \begin{pmatrix} w_{1,1} && w_{1,2} && w_{1,3} && w_{1,4} && w_{1,5} \\ w_{2,1} && w_{2,2} && w_{2,3} && w_{2,4} && w_{2,5} \\ w_{3,1} && w_{3,2} && w_{3,3} && w_{3,4} && w_{3,5} \\ w_{4,1} && w_{4,2} && w_{4,3} && w_{4,4} && w_{4,5} \\ w_{5,1} && w_{5,2} && w_{5,3} && w_{5,4} && w_{5,5} \end{pmatrix} \cdot \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{pmatrix} =$
$= \sum \limits_{i,j \in A} w_{i,j} + \sum \limits_{i,j \in B} w_{i,j}  \sum \limits_{i \in A, j \in B} w_{i,j}  \sum \limits_{i \in B, j \in A} w_{i,j} = \sum \limits_{i,j \in A} w_{i,j} + \sum \limits_{i,j \in B} w_{i,j}  2 \cdot \sum \limits_{i, j \in cut(A, B)} w_{i,j}$.
Now, take the degree matrix $D$, create a similar quadratic form out of if:
${\bf h}^T D {\bf h} = \begin{pmatrix} 1 && 1 && 1 && 1 && 1 \end{pmatrix} \cdot \begin{pmatrix} \sum \limits_j{w_{1,j}} && ... && 0 \\ 0 && \sum \limits_j{w_{i,j}} && 0 \\ 0 && ... && \sum \limits_j{w_{V,j}} \end{pmatrix} \cdot \begin{pmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{pmatrix}$.
If we now subtract the quadratic forms, we’d see that all edges, except by the members of $cut(A, B)$, cancel each other out:
${\bf h}^T D {\bf h}  {\bf h}^T A {\bf h} = {\bf h}^T (DA) {\bf h} = {\bf h}^T L {\bf h} = 4 \cdot \sum \limits_{i, j \in cut(A, B)} w_{i,j}$.
In other words, to minimize the $cut(A, B)$, we need to find indicator vectors ${\bf h}$, such that they minimize the quadratic form ${\bf h}^T L {\bf h}$, where $L$ is graph Laplacian. We just discovered another interesting connection with linear algebra and spectral graph theory, which is going to be instrumental a bit later.
For now, I shall notice that although MinMaxCut algorithm was enlightening, in its raw form it was prone to a fallacy of selecting a single most disconnected vertex as $A$ and cutting it off of the rest of the graph:
Lemma 4.1. Normalized Cut problems on a graph correspond to the kmeans clustering problem
To address the problem of undesired behaviour of MinMaxCut, Shi and Malik (1997, 2000) came up with their (now famous) Normalized Cut algorithm. They tweaked the optimized function to cut away a sizable portion of the edges of the graph relative to the total weights of edges, kept within its halves:
$Ncut(A, B) = \frac{cut(A, B)}{assoc(A, V)} + \frac{cut(A, B)}{assoc(B, V)}$, where $assoc(A, V) = \sum \limits_{u \in A, t \in V} w(u, t)$.
If we take a look at the weighted adjacency matrix, representing the graph, and would cluster subgraphs $A$ and $B$ at diagonal blocks, we’d see that $cut(A, B)$ corresponds to the sum of weights of the offdiagonal blocks, while denominators $assoc(A, V)$ and $assoc(B, V)$ form fullwidth horizontal (or fullwidth vertical) blocks:
Shi and Malik use a bit different version of indicator vectors. Unlike MinMaxCut their indicator vectors are comprised of 01 coordinates $\bf h$: $h_i = \begin{cases} 1, i \in A \\ 0, i \in B \end{cases}$.
Then $Ncut(A, B) = \frac{cut(A, B)}{assoc(A, V)} + \frac{cut(A, B)}{assoc(B, V)} = \frac{{\bf h}^T(DA){\bf h}}{{\bf h}^T D {\bf h}} + \frac{{\bf (1h)}^T(DA){\bf (1h)}}{{\bf (1h)}^T D {\bf (1h)}}$.
Now, if we take a look at the denominators, ${\bf h}^T D {\bf h} = {\bf 1}^T D {\bf 1}  {\bf (1h)}^T D {\bf (1h)}$.
Shi and Malik then define a constant $k$, such that ${\bf h}^T D {\bf h} = k \cdot {\bf 1}^T D {\bf 1}$ and ${\bf (1h)}^T D {\bf (1h)} = (1k) \cdot {\bf 1}^T D {\bf 1}$.
Then $Ncut(A, B) = \frac{{\bf h}^T(DA){\bf h}}{{\bf h}^T D {\bf h}} + \frac{{\bf (1h)}^T(DA){\bf (1h)}}{{\bf (1h)}^T D {\bf (1h)}} = \frac{{\bf h}^T(DA){\bf h}}{k \cdot {\bf 1}^T D {\bf 1}} + \frac{{\bf (1h)}^T(DA){\bf (1h)}}{(1k) \cdot {\bf 1}^T D {\bf 1}} = \frac{ (1k) \cdot {\bf h}^T(DA){\bf h} + k \cdot {\bf (1h)}^T(DA){\bf (1h)} }{k \cdot (1k) \cdot {\bf 1}^T D {\bf 1}}$.
After that Shi and Malik define a ratio $b = \frac{k}{1k}$. Then they perform a long series of simplifications and show that:
$Ncut(A, B) = \frac{ (1k) \cdot {\bf h}^T(DA){\bf h} + k \cdot {\bf (1h)}^T(DA){\bf (1h)} }{k \cdot (1k) \cdot {\bf 1}^T D {\bf 1}} = \frac{({\bf h}  b \cdot ({\bf 1h}))^T (DA) ({\bf h}  b \cdot ({\bf 1h})) }{b \cdot {\bf 1}^T D {\bf 1}}$.
Finally, Shi and Malik define a new variable $\bf y = {\bf h}  b ({\bf 1h)}$ and apply two changes to the resulting formula:

They notice that the denominator can be altered: ${\bf y}^T D {\bf y} = \sum \limits_{i \in A}d_i + b^2 \sum \limits_{i \in B}d_i = b \sum \limits_{i \in B}d_i + b^2 \sum \limits_{i \in B}d_i = b (\sum \limits_{i \in B}d_i + b \sum \limits_{i \in B}d_i) = b {\bf 1}^T D {\bf 1}$