cover

Conjugate gradients

September 21, 2023 27 min read

Conjugate gradients is one of the most popular computationally effective methods, used to solve systems of linear equations. Along with other Krylov space methods it can also be used for finding eigenvalues and eigenvectors of a matrix.

System of linear equations as a quadratic optimization problem

In case AA is a Gram matrix (i.e. symmetric and positive definite), so that its eigenvalues are real and positive, it is useful to think about solution of a system of linear equations Ax=bAx = b in terms of an equivalent quadratic problem of finding a minimum of a quadratic function, given by a quadratic form:

x:Axb=0    x=argminxϕ(x)=xTAxbTxx^* : A x - b = 0 \iff x^* = \arg \min \limits_x \phi(x) = x^T A x - b^T x

Quadratic form xTAxx^T A x represents ellipsoid, eigenvectors {ei}\{ e_i \} of matrix AA are axes, eigenvalues {ei}\{ e_i \} are lengths semi-axes, difference between the optimal solution and current point is xxx^* - x, residue vector r=Axb=AxAxr = Ax - b = Ax - Ax^* will be discussed later.

Naive approaches

How to solve a system of linear equations computationally?

We shall briefly consider two naive approches: coordinate descent and steepest descent, before discussing conjugate gradients and their advantages.

Coordinate descent

First, let us look at the simplest approach: coordinate descent. I.e. we iteratively choose coordinates of the vector xix_i and minimize the value of function ϕ(x)\phi(x) in that coordinate by solving ϕ(x)xi=0\frac{\partial \phi(x)}{\partial x_i} = 0.

Convergence

Coordinate descent clearly does not guarantee convergence in N steps:

coordinate descent

Coordinate descent. In this example we have a 2D vector xx, and coordinate descent does not converge to the solution xx^* in 2 steps.

Steepest descent

A smarter approach is steepest descent.

At each step of coordinate descent we calculate the gradient vector ϕ\nabla \phi of function ϕ(x)\phi(x) and use it as the direction of descent.

Here and below I will denote dd the direction vector for descent. If we are speaking of multiple steps, I will denote it as did_i. In case of the steepest descent at each step di=xid_i = \nabla x_i, where xix_i is the approximation of solution xx^*, obtained at ii-th step, starting from an initial approximation x0x_0.

If we are currently at the point xx, we shall move to some point x+αdx + \alpha d, where α\alpha is the step. We know the direction, but how do we find the optimal step α\alpha of descent?

Length of the step is optimal, when the function ϕ(x+αd)\phi(x + \alpha d) is minimal in some point x+αdx + \alpha d. It means that directional derivative ϕd=0\frac{\partial \phi}{\partial d} = 0 and the isolevels of the function ϕ(x)=xTAxbTx\phi(x) = x^T A x - b^T x become tangent to the direction dd of descent.

Line search

Line search. We start from the point xx (defined by a vector, starting from origin) and start searching in the direction dd for an optimal length α\alpha of dd vector, so that the directional derivative of quadratic function ϕ(x)\phi(x) is 00 there, i.e. direction vector is tangent to isolevels of function ϕ(x)\phi(x) at that point.

Speaking in terms of multivariate calculus d,ϕx=0\langle d, \frac{\partial \phi}{\partial x} \rangle = 0.

Simplifying this expression, we obtain the step length α\alpha:

dT(A(x+αd)b)=0d^T (A (x + \alpha d) - b) = 0

dTAx+αdTAddTAddTb=0d^T A x + \alpha d^T A d - d^T A d - d^T b = 0

α=dT(Axb)dTAd=dTrdTAd\alpha = - \frac{d^T (Ax - b)}{d^T A d} = - \frac{d^T r}{ d^T A d }

Now our steepest descent algorithm is fully defined. By the way, this formula for step length α\alpha is going to be useful below.

Convergence

Convergence rate of steepest descent varies. For instance, it might converge in one iteration, if AA is diagonal and all of its eigenvalues are equal:

Perfect convergence of steepest descent

Perfect convergence of steepest descent. If matrix AA is diagonal with equal eigenvalues, isolevels are spheres and steepest descent converges in 1 iteration.

However, in general we cannot guarantee fast convergence for steepest descent. E.g. if the matrix AA is ill-conditioned, i.e. the difference in magnitude between its largest and smallest eigenvalues is big, steepest descent will start jumping from one side of the ellipsoid to the other.

Poor convergence of steepest descent

Poor convergence of steepest descent. If we have an ill-conditioned matrix, so that even one eigenvalue of matrix AA is much smaller than other eigenvalues of matrix AA, the ellipsoid isolevels look like a ravine, and gradient jumps off the sides of the ravine. Image from J.R.Schewchuk book.

Hence, convergence rate is related to condition numbers κ\kappa of the matrix. One can actually derive a formula for convergence rate:

f(x(i))f(x)f(x(0))f(x)=12e(i)TAe(i)12e(0)TAe(0)(κ1κ+1)2i\frac{f(x_{(i)}) - f(x)}{f(x_{(0)}) - f(x)} = \frac{\frac{1}{2} e_{(i)}^T A e_{(i)} }{ \frac{1}{2} e_{(0)}^T A e_{(0)} } \le (\frac{\kappa - 1}{\kappa + 1})^{2i}

Exact derivation of this fact is somewhat tedious (although, not so hard for understanding) and can be found in J.R. Schewchuk tutorial, part 6.2.

Conjugate directions and conjugate gradients

As we’ve seen convergence of neither coordinate descent, nor steepest descent in NN steps is guaranteed.

Here we discuss a better approach, conjugate directions, which will converge in NN steps or less. It has a special case, conjugate gradients, which is even more computationally efficient.

Let us start with an example of orthogonal axes and then transition to a general case.

Orthogonal axes

Consider the case when all the eigenvectors of our surface are parallel to the coordinate axes. In such a case matrix AA is diagonal and performing line search along the coordinate axes guarantees convergence in N iterations.

eigenvectors paraller to axes

If matrix AA is diagonal, line search along the conjugate directions converges in NN steps. Image based on J.R.Schewchuk.

However, in general case we cannot rely on orthogonality of axes. If eigenvectors of matrix AA are not orthogonal and parallel to coordinate axes, this approach won’t work and line search along orthogonal directions would not work.

A-orthogonality

Instead of orthogonality, we introduce the notion of A-orthogonality:

a,bA=0    Aa,b=0    a,Ab=0\langle a, b \rangle_A = 0 \iff \langle A a, b \rangle = 0 \iff \langle a, A b \rangle = 0

We will apply two transformations of coordinates.

First, apply eigen decomposition of AA:

A=EΛE1A = E \Lambda E^{-1}, where EE is the matrix of eigenvectors (each eigenvector is a column-vector) and EE is orthogonal, because eigenvectors of symmetric matrix are orthogonal; Λ\Lambda is the diagonal matrix of eignevalues, where each eigenvalue λi\lambda_i is known to be real, positive and related to a corresponding singular value σi\sigma_i as λi=σi2\lambda_i = \sigma_i^2.

Aa,b=bTAa=bTEΛE1a\langle A a, b \rangle = b^T A a = b^T E \Lambda E^{-1} a

Use the fact that EE is orthogonal for symmetric matrix AA, hence, ET=E1E^T = E^{-1}. Hence, (ETb)T=bTE=βT(E^T b)^T = b^T E = \beta^T, ETa=αE^T a = \alpha.

What did we do? We changed the coordinates, so that coordinate axes are now the eigenvectors EE.

In these coordinates vector aa has coordinates α\alpha and vector bb has coordinates β\beta.

Second, define Λ1/2=(λ1000λ2000λ3)\Lambda^{1/2} = \begin{pmatrix} \sqrt{\lambda_1} && 0 && 0 \\ 0 && \sqrt{\lambda_2} && 0 \\ 0 && 0 && \sqrt{\lambda_3} \end{pmatrix}.

Aa,b=bTAa=bTEΛE1a=βTΛα=(Λ1/2β)T(Λ1/2α)\langle A a, b \rangle = b^T A a = b^T E \Lambda E^{-1} a = \beta^T \Lambda \alpha = (\Lambda^{1/2} \beta)^T \cdot (\Lambda^{1/2} \alpha).

What did we do here? We stretched our coordinate axes by their respective eigenvalues λi\lambda_i, transforming our vectors α\alpha into Λ1/2α=Λ1/2E1a\Lambda^{1/2} \alpha = \Lambda^{1/2} E^{-1}a and β\beta into Λ1/2β=Λ1/2E1b\Lambda^{1/2} \beta = \Lambda^{1/2} E^{-1}b.

Eigenvectors, forming the semi-axes of ellipsoids, upon this change of coorindates take identical lengths, so that our ellipsoids become spheres.

So, how does A-orthogonality work: first we go from original coordinates to coordinates, where axes EE are eigenvectors of AA. We apply this transform to both aa and bb vectors, resulting in their eigendecomposition vectors E1aE^{-1}a and E1bE^{-1}b.

Second, we come to a coordinate system, where isolevels are spheres. In these coordinates resulting vectors Λ1/2E1a\Lambda^{1/2} E^{-1}a and Λ1/2E1b\Lambda^{1/2} E^{-1}b must be orthogonal.

A othogonality

A-orthogonality: vectors that are A-orthogonal become orthogonal, if we make two changes of coordinates: first, we make eigenvectors the axes of our coordinate system, obtaining image (a); second, we stretch coordinate axes by eigenvalues, so that our isolevels become concentric spheres, resulting in image (b). Image from J.R.Schewchuk.

Conjugate directions

Let us prove the guarantees of convergence for conjugate directions algorithm in NN steps.

Consider a descent algorithm, which uses vectors {di}\{ d_i \} as descent directions and step sizes αi=diTridiTAdi\alpha_i = - \frac{d_i^T r_i }{d_i^T A d_i}, where ri=Axibr_i = A x_i - b is the current residue.

xn=x0+α0d0+α1d1+...+αn1dn1x_n = x_0 + \alpha_0 d_0 + \alpha_1 d_1 + ... + \alpha_{n-1} d_{n-1}

We want to prove that if {di}\{ d_i \} are conjugate directions, then xn=xx_n = x^*.

Indeed, we know that vectors {di}\{ d_i \} span the whole space. Hence, there exist coefficients {σi}\{ \sigma_i \}, such that we can come to the desired solution point xx^* walking the directions:

x=x0+σ0d0+σ1d1+...+σn1dn1x^* = x_0 + \sigma_0 d_0 + \sigma_1 d_1 + ... + \sigma_{n-1} d_{n-1}

Pre-multiply this expression by diAd_{i} A:

diTA(xx0)=σ0diTAd0+...+σ1diTAdi+...+σn1diTAdn1d_{i}^T A (x^* - x_0) = \cancel{\sigma_0 d_{i}^T A d_0} + ... + \sigma_1 d_{i}^T A d_{i} + ... + \cancel{\sigma_{n-1} d_{i}^T A d_{n-1}}

Hence, σi=diTA(xx0)diTAdi\sigma_{i} = \frac{d_{i}^T A (x^* - x_0)}{ d_{i}^T A d_{i} }.

Recall that Ax=bAx^* = b (because xx^* is the perfect solution of our system of linear equations in the end).

Thus, σi=diTA(xx0)diTAdi=diT(bAx0)diTAdi=diTr0diTAdi\sigma_{i} = \frac{d_{i}^T A (x^* - x_0)}{ d_{i}^T A d_{i} } = \frac{ d_{i}^T (b - A x_0) }{ d_{i}^T A d_{i} } = \frac{ -d_{i}^T r_0 }{ d_{i}^T A d_{i} }.

Now, consider xi=x0+α0d0+α1d1+...+αi1di1x_i = x_0 + \alpha_0 d_0 + \alpha_1 d_1 + ... + \alpha_{i-1} d_{i-1} and pre-multiply it by diTAd_i^T A:

diTA(xix0)=0d_i^T A (x_i - x_0) = 0, hence, diTAxi=diTAx0d_i^T A x_i = d_i^T A x_0.

Then substitute this into σi1=diTr0diTAdi\sigma_{i-1} = \frac{ -d_{i}^T r_0 }{ d_{i}^T A d_{i} }.

This produces σi=diTridiTAdi\sigma_{i} = \frac{ -d_{i}^T r_i }{ d_{i}^T A d_{i} }, which coincides with αi=diTridiTAdi\alpha_i = - \frac{d_i^T r_i }{d_i^T A d_i}.

Hence, σi=αi\sigma_i = \alpha_i.

Conjugate directions as an energy norm minimization

An insightful way of thinking about conjugate gradients is through norm minimization/maximization.

One can show that conjugate directions is similar to PCA in terms of minimization/maximization of a certain norm.

PCA at each iteration finds a projection that minimizes Frobenius norm of approximation of our matrix by low-rank matrices:

AkiλkukvkTFmin|| A - \sum_k^i \lambda_k u_k v_k^T ||_F \to \min

TODO: check PCA analogy

TODO: duality between energy norm and Frobenius norm (maximum of energy norm corresponds to minimum of Frobenius norm)

Similarly, we can introduce a notion of energy norm of vector xx: xA=xTAx|| x ||_A = x^T A x .

One may show that at each step conjugate directions algorithm chooses such a vector did_i that minimizes the energy norm of error:

eiAmin|| e_i ||_A \to \min

eiA=eiTAei=(k=in1dkT)A(k=in1dk)=k=in1dkTAdk|| e_i ||_A = e_i^T A e_i = (\sum \limits_k=i^{n-1} d_k^T ) A (\sum \limits_k=i^{n-1} d_k ) = \sum \limits_{k=i}^{n-1} d_k^T A d_k.

TODO: check notation and finish this part

Conjugate directions and eigenvectors

Let A=EΛETA = E \Lambda E^T be the eigenvector decomposition of AA.

Consider the definition of conjugate directions: PTAP=DP^T A P = D, where DD is any diagonal matrix, PP is the matrix of conjugate directions row-vector (orthogonal).

A=PDPTA = P D P^T

EΛET=PDPTE \Lambda E^T = P D P^T

EΛ1/2=PD1/2E \Lambda^{1/2} = P D^{1/2}

P=EΛ1/2D1/2=E(ΛD1)1/2P = E \Lambda^{1/2} D^{-1/2} = E (\Lambda D^{-1})^{1/2}

Conjugate gradients

How do we obtain conjugate directions in practice?

Simple and straightforward approach: let us use neg-gradients as directions: di+1=ri+1=A(xi+1b)d_{i+1} = - r_{i+1} = - A(x_{i+1} - b). Problem is, our “conjugate” directions are not conjugate. diTAdi+10d_i^T A d_{i+1} \ne 0, directions are not A-orthogonal (although, by the way, we know that di+1d_{i+1} is orthogonal, not A-orthogonal to the previous search direction due to tangency of search direction to isolevels at the point xi+1x_{i+1}.

conjugate_directions

Conjugate directions in 2D. Note that r(1)r_{(1)} is orthogonal to d(0)d_{(0)} and e(1)e_{(1)} is A-orthogonal to d(0)d_{(0)}. Image from J.R. Schewchuk.

So, we correct the gradient direction by subtracting did_{i} with some coefficient to A-orthogonalize di+1d_{i+1} and did_{i}:

We know that di+1d_{i+1} is orthogonal to each of the previous directions {d1,...,di1}\{ d_1, ..., d_{i-1} \}.

x=x0+α0d0+α1d1+...+αn1dn1x^* = x_0 + \alpha_0 d_0 + \alpha_1 d_1 + ... + \alpha_{n-1} d_{n-1}

xi+1=x0+α0d0+...+αidix_{i+1} = x_0 + \alpha_0 d_0 + ... + \alpha_i d_i

ri+1=Axi+1b=Axi+1Ax=A(x0+α0d0+...+αidi)A(x0+α0d0+α1d1+...+αn1dn1)=A(αi+1di+1+...+αn1dn1)r_{i+1} = Ax_{i+1} - b = Ax_{i+1} - Ax^* = A (x_0 + \alpha_0 d_0 + ... + \alpha_i d_i) - A (x_0 + \alpha_0 d_0 + \alpha_1 d_1 + ... + \alpha_{n-1} d_{n-1}) = - A (\alpha_{i+1}d_{i+1} + ... + \alpha_{n-1} d_{n-1})

dkTri+1=0d_k^T r_{i+1} = 0

Hence, ri+1r_{i+1} is already orthogonal to {d0,...,di}\{ d_0, ..., d_{i} \}.

But more importantly for conjugate gradients, it is not only orthogonal, but also A-orthogonal to {d0,...,di1}\{ d_0, ..., d_{i-1} \}!

xi+1=xi+αidix_{i+1} = x_{i} + \alpha_i d_{i}

ri+1=ri+αiAdir_{i+1} = r_{i} + \alpha_i A d_{i}

αiAdi=ri+1ri\alpha_i A d_i = r_{i+1} - r_{i}

Pre-multiply this with rkTr^T_k: αirkTAdi=rkTri+1rkTri\alpha_i r^T_k A d_i = r^T_k r_{i+1} - r^T_k r_{i}.

The following cases arise: {k<i:rkTAdi=0k=i:riTAdi=riTrik=i+1:ri+1TAdi=ri+1Tri+1\begin{cases} k < i: r^T_k A d_i = 0 \\ k = i: r^T_i A d_i = - r^T_i r_{i} \\ k = i + 1: r^T_{i+1} A d_i = r^T_{i+1} r_{i+1} \end{cases}

Hence, ri+1r_{i+1} is already A-orthogonal to dkd_k for k<ik < i. To A-orthogonalize di+1d_{i+1} derived from ri+1r_{i+1}, all we need to do is to orthogonalize it with respect to did_i. This greatly simplifies the calculation process and saves us a lot of memory:

di+1=ri+1+βdid_{i+1} = - r_{i+1} + \beta d_{i}

Calculation of β\beta is evident from A-orthogonality of di+1d_{i+1} and did_{i}: pre-multiply this expression with diTAd^T_{i} A:

diTAdi+1=diTAri+1+βdiTAdi\cancel{d^T_{i} A d_{i+1}} = - d^T_{i} A r_{i+1} + \beta d^T_{i} A d_{i} (all other dkTAdi=0d^T_{k} A d_{i} = 0)

diTAri+1=βdiTAdid^T_{i} A r_{i+1} = \beta d^T_{i} A d_{i}

β=diTAri+1diTAdi\beta = \frac{ d^T_{i} A r_{i+1} }{ d^T_{i} A d_{i} }

This is just a special case of Schmidt orthogonalization.

Krylov space

One more important aspect: assume x0=0x_0 = 0 and consider r0=bAx=b-r_0 = b - Ax = b. Hence, x1=bx_1 = b.

x2x1=r2+β(x1x0)=(bAx2)+βx1x_2 - x_1 = - r_{2} + \beta (x_1 - x_0) = (b - A x_2) + \beta x_1

x2(I+A)=b+(1+β)x1=b+(1+β)b=2b+βbx_2 (I + A) = b + (1 + \beta) x_1 = b + (1 + \beta) b = 2b + \beta b

x2=b(2+β)x_2 = b (2 + \beta)

Similarly, if we consider x3x_3, we get a polynomial of A2bA^2 b, AbA b and bb. And so on.

Hence, in case of xix_i we are working with a space of vectors [Ai1b,Ai2b,...,Ab,b][ A^{i-1} b, A^{i-2} b, ..., A b, b ]

This space is called Krylov space and is useful in related problems, such as numeric eigenvector/eigenvalue estimation. Such problems are often solved with a related family of algorithms, such as Arnoldi iteration and Lanczos algorithm.

TODO: fix bugs

Convergence rate: inexact arithmetic

If we allow for a certain error, conjugate gradients might not even require NN iterations. Consider the following clustering of eigenvalues of our matrix AA:

Eigenvalue clusters

Two clusters of eigenvalues. From Nocedal-Wright’s Numerical optimization.

Recall the convergence rate formula: xnkxA(λnkλ1λnk+λ1)2x0xA|| x_{n-k} - x^* ||_A \le (\frac{\lambda_{n-k} - \lambda_1}{ \lambda_{n-k} + \lambda_1 })^2 || x_0 - x^* ||_A.

Suppose that we’ve already made m+1m+1 iterations. Then λn(m+1)λ1\lambda_{n-(m+1)} - \lambda_1 is very small and λn(m+1)+λ12λ1\lambda_{n-(m+1)} + \lambda_1 \approx 2 \lambda_1.

Hence, xnkxAϵ|| x_{n-k} - x^* ||_A \le \epsilon, where ϵ>0\epsilon > 0 is small, and we’ve basically converged to the solution.

Preconditioning and clusters of eigenvalues

If the matrix AA is ill-conditioned, iterative multiplication of a vector bb by a matrix AA would accumulate the floating point rounding errors, resuling in a failure to converge to the correct solution. Hence, we need to “normalize” the matrix AA similarly to how we apply batch norm or layer norm to normalize the layers of neural networks.

This process is called pre-conditioning. We are multiplying both sides of our equation by a pre-conditioner matrix PP: PAx=PbP A x = P b.

An ideal pre-conditioner would convert our matrix to a perfect sphere. Then pre-conditioning axes should be the eigenvectors and coefficients be eigenvalues. So, the perfect pre-conditioner is the matrix inverse. However, this is computationally impractical, as it is more-or-less equivalent to the solution of the initial system Ax=bAx = b and would take O(N3)O(N^3) operations.

The simplest pre-conditioner matrix is a diagonal matrix of the values at the diagonal of the matrix AA. This is called Jacobi pre-conditioner. It is equivalent to scaling the matrix along the coordinate axes by values at the diagonal of matrix AA. Obviously, this is an imperfect solution.

A more practical pre-conditioner would be an incomplete Cholesky decomposition. I will first discuss a regular Cholesky decomposition in the context of conjugate directions, and then its sparse variation.

Cholesky decomposition as a special case of conjugate directions

Cholesky decomposition is a representation of our matrix AA as a product of lower-diagonal matrix LL and its transpose upper-diagonal matrix LTL^T: A=LLTA = L L^T.

One can see that the vectors of inverse of Cholesky decomposition are actually conjugate directions of matrix AA, such that the first direction is parallel to the first coordinate axis, second direction is in the plan of the first two axes etc. Consider an upper-triangular matrix VV:

VTAV=IV^T A V = I

VTA=V1V^T A = V^{-1}

A=V1TV1A = {V^{-1}}^T V^{-1}

Now we can denote L=(V1)TL = {(V^{-1})}^T and LT=V1L^T = V^{-1}, and we get A=LLTA = L L^T.

To see that inverse of an upper-triangular matrix is upper-triangular, just apply the textbook method of matrix inversion, i.e. write down a matrix and a unit matrix next to each other, Gauss-eliminate the left, and you will obtain its inverse from the right:

(a11a21an11an11000 a22an12an2100  an1n1ann110 ann1)\begin{pmatrix} a^1_1 & a^1_2 & \cdots & a^1_{n-1} & a^1_n & 1 & 0 & \cdots & 0 & 0 \\\ & a^2_2 & \cdots & a^2_{n-1} & a^2_n & & 1 & \cdots & 0 & 0 \\\ & & \ddots & \vdots & \vdots & & & \ddots & \vdots & \vdots \\\ & & & a^{n-1}_{n-1} & a^{n-1}_n & & & & 1 & 0 \\\ & & & & a^n_n & & & & & 1 \end{pmatrix}

Cholesky decomposition algorithm

The algorithm for Cholesky decomposition is actually pretty straightforward.

(a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3)=(l1,100l2,1l2,20l3,1l3,2l3,3)(l1,1l2,1l3,10l2,2l3,200l3,3)\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ \end{pmatrix} = \begin{pmatrix} l_{1,1} & 0 & 0 \\ l_{2,1} & l_{2,2} & 0 \\ l_{3,1} & l_{3,2} & l_{3,3} \\ \end{pmatrix} \cdot \begin{pmatrix} l_{1,1} & l_{2,1} & l_{3,1} \\ 0 & l_{2,2} & l_{3,2} \\ 0 & 0 & l_{3,3} \\ \end{pmatrix}
(a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3)=(l1,100??0???)(l1,1??0??00?)    l1,12=a1,1\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ \end{pmatrix} = \begin{pmatrix} l_{1,1} & 0 & 0 \\ ? & ? & 0 \\ ? & ? & ? \\ \end{pmatrix} \cdot \begin{pmatrix} l_{1,1} & ? & ? \\ 0 & ? & ? \\ 0 & 0 & ? \\ \end{pmatrix} \implies l^2_{1,1} = a_{1,1}
(a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3)=(l1,100??0???)(?l2,1?0??00?)    l2,1=a1,2/l1,1\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ \end{pmatrix} = \begin{pmatrix} l_{1,1} & 0 & 0 \\ ? & ? & 0 \\ ? & ? & ? \\ \end{pmatrix} \cdot \begin{pmatrix} ? & l_{2,1} & ? \\ 0 & ? & ? \\ 0 & 0 & ? \\ \end{pmatrix} \implies l_{2,1} = a_{1,2} / l_{1,1}
(a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3)=(l1,100??0???)(??l3,10??00?)    l3,1=a1,3/l1,1\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ \end{pmatrix} = \begin{pmatrix} l_{1,1} & 0 & 0 \\ ? & ? & 0 \\ ? & ? & ? \\ \end{pmatrix} \cdot \begin{pmatrix} ? & ? & l_{3,1} \\ 0 & ? & ? \\ 0 & 0 & ? \\ \end{pmatrix} \implies l_{3,1} = a_{1,3} / l_{1,1}
(a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3)=(?00l2,1l2,20???)(?l2,1?0l2,2?00?)    l2,22=a2,2l1,12\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ \end{pmatrix} = \begin{pmatrix} ? & 0 & 0 \\ l_{2,1} & l_{2,2} & 0 \\ ? & ? & ? \\ \end{pmatrix} \cdot \begin{pmatrix} ? & l_{2,1} & ? \\ 0 & l_{2,2} & ? \\ 0 & 0 & ? \\ \end{pmatrix} \implies l_{2,2}^2 = a_{2,2} - l_{1,1}^2
(a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3)=(?00??0l3,1??)(l1,1??0??00?)    l3,1=a3,1/l1,1\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ \end{pmatrix} = \begin{pmatrix} ? & 0 & 0 \\ ? & ? & 0 \\ l_{3,1} & ? & ? \\ \end{pmatrix} \cdot \begin{pmatrix} l_{1,1} & ? & ? \\ 0 & ? & ? \\ 0 & 0 & ? \\ \end{pmatrix} \implies l_{3,1} = a_{3,1} / l_{1,1}
(a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3)=(?00l1,2l2,20???)(??l3,10?l3,200l3,3)    l1,2l3,1+l2,2l3,2=a2,3    l3,2=a2,3l1,2l3,1l2,2\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ \end{pmatrix} = \begin{pmatrix} ? & 0 & 0 \\ l_{1,2} & l_{2,2} & 0 \\ ? & ? & ? \\ \end{pmatrix} \cdot \begin{pmatrix} ? & ? & l_{3,1} \\ 0 & ? & l_{3,2} \\ 0 & 0 & l_{3,3} \\ \end{pmatrix} \implies l_{1,2} l_{3,1} + l_{2,2} l_{3,2} = a_{2,3} \implies l_{3,2} = \frac{ a_{2,3} - l_{1,2} l_{3,1} } { l_{2,2} }
(a1,1a1,2a1,3a2,1a2,2a2,3a3,1a3,2a3,3)=(?00??0l3,1l3,2l3,3)(??l3,10?l3,200l3,3)    l3,32=a3,3l3,12l3,22\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \\ \end{pmatrix} = \begin{pmatrix} ? & 0 & 0 \\ ? & ? & 0 \\ l_{3,1} & l_{3,2} & l_{3,3} \\ \end{pmatrix} \cdot \begin{pmatrix} ? & ? & l_{3,1} \\ 0 & ? & l_{3,2} \\ 0 & 0 & l_{3,3} \\ \end{pmatrix} \implies l_{3,3}^2 = a_{3,3} - l_{3,1}^2 - l_{3,2}^2

Incomplete Cholesky decomposition

Full Cholesky decomposition is expensive and takes ~O(N3)O(N^3) operations.

In case of sparse matrices (or introducing sparsification ourselves), we can achieve the same result with a much smaller compute.

The simplest form of incomplete Cholesky decomposition repeats the sparsity patter of the original matrix. E.g. if the original matrix is:

A=(1002030400502406)A = \begin{pmatrix} 1 && 0 && 0 && 2 \\ 0 && 3 && 0 && 4 \\ 0 && 0 && 5 && 0 \\ 2 && 4 && 0 && 6 \end{pmatrix}

its incomplete Cholesky decomposition would have a sparsity pattern of:

L=(?0000?0000?0??0?)L = \begin{pmatrix} ? && 0 && 0 && 0 \\ 0 && ? && 0 && 0 \\ 0 && 0 && ? && 0 \\ ? && ? && 0 && ? \end{pmatrix}

Zeros above main diagonal stem from the fact that LL is lower diagonal, zeros below the main diagonal reproduce the sparsity pattern of matrix AA:

A=LLT=(?0000?0000?0??0?)(?00?0?0?00?0000?)A = L L^T = \begin{pmatrix} ? && 0 && 0 && 0 \\ 0 && ? && 0 && 0 \\ 0 && 0 && ? && 0 \\ ? && ? && 0 && ? \end{pmatrix} \cdot \begin{pmatrix} ? && 0 && 0 && ? \\ 0 && ? && 0 && ? \\ 0 && 0 && ? && 0 \\ 0 && 0 && 0 && ? \end{pmatrix}

This approach is known as IC(0)IC(0). If you want more precision and less sparsity, you can take the sparsity pattern of A2A^2, which has less 0 elements, and calculate Cholesky decomposition for it. This would be called IC(1)IC(1). And so on, then it would be called IC(k)IC(k). Greater kk would result in a better pre-conditioning at the cost of more compute.

If the matrix AA is not sparse at all, we can employ IC(τ)IC(\tau) approach: choose some threshold of absolute value of an element of matrix AA, called τ\tau and if the element is smaller than that, consider it 0. Thus, we achieve a sparse matrix with a Frobenius norm more or less close to the original matrix AA. Hence, we get a reasonable pre-conditioner at an O(N2)O(N^2) compute and O(N)O(N) memory cost.

References:


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram