cover

Intro to compressed sensing

June 21, 2022 40 min read

For almost a century the field of signal processing existed in the paradigm of Nyquist-Shannon theorem (also known as Kotelnikov theorem in Russian literature) that claimed that you cannot extract more than n Fourier harmonics from a signal, if you have n measurements. However, thanks to the results in functional analysis from 1980s, such as Lindenstrauss-Johnson lemma and Kashin-Garnaev-Gluskin inequality, it became evident, that you can do with as few as log(n)! After application of the L1-norm machinery developed in 1990s, such as LASSO and LARS, a new groundbreaking theory of compressed sensing emerged in mid-2000s to early 2010s. In this post I'll briefly cover some of its ideas and results.

Motivation: Nyquist-Shannon (Kotelnikov) theorem

Suppose that you have a WAV file. It basically contains sound pressures, measured 44 100 or 48 000 times a second.

Why 44 100/48 000?

Suppose that you want to recover harmonics of frequencies 1 Hz, 2 Hz, 3 Hz, … up to 20 kHz (more or less the upper threshold of what a human ear can hear). Each harmonic of your sound of frequency ωi\omega_i is described by 2 variables: amplitude AiA_i and phase ϕi\phi_i.

So, if you are trying to recover those 20 000 pairs of phases/amplitude, you need at least 40 000 of sound pressure measurements pip_i measured at moments of time tit_i, when you apply Fourier transform to recover it. Here it is:

{A1e2πjω1(t1+ϕ1)40000+A2e2πjω2(t1+ϕ2)40000+...+A20000e2πjω20000(t1+ϕ20000)40000=p1A1e2πjω1(t2+ϕ1)40000+A2e2πjω2(t2+ϕ2)40000+...+A20000e2πjω20000(t2+ϕ20000)40000=p2...A1e2πjω1(t40000+ϕ1)40000+A2e2πjω2(t40000+ϕ2)40000+...+A20000e2πjω20000(t20000+ϕ20000)40000=p40000\begin{cases}A_1 e^{ -\frac{2 \pi j \cdot \omega_1 \cdot (t_1 + \phi_1)}{40000} } + A_2 e^{-\frac{2 \pi j \cdot \omega_2 \cdot (t_1 + \phi_2)}{40000}} + ... + A_{20000} e^{-\frac{2 \pi j \cdot \omega_{20000} \cdot (t_1 + \phi_{20000})}{40000}} = p_1 \\ A_1 e^{ -\frac{2 \pi j \cdot \omega_1 \cdot (t_2 + \phi_1)}{40000} } + A_2 e^{-\frac{2 \pi j \cdot \omega_2 \cdot (t_2 + \phi_2)}{40000}} + ... + A_{20000} e^{-\frac{2 \pi j \cdot \omega_{20000} \cdot (t_2 + \phi_{20000})}{40000}} = p_2 \\ ... \\ A_1 e^{ -\frac{2 \pi j \cdot \omega_1 \cdot (t_{40000} + \phi_1)}{40000} } + A_2 e^{-\frac{2 \pi j \cdot \omega_2 \cdot (t_{40000} + \phi_2)}{40000}} + ... + A_{20000} e^{-\frac{2 \pi j \cdot \omega_{20000} \cdot (t_{20000} + \phi_{20000})}{40000}} = p_{40000} \end{cases}

Basically, what Nyquist-Shannon theorem states is that there is no way to recover 40000 variables without at least 40000 linear equations.

Or is it?

Compressed sensing

In 1995 Chen and Donoho realised that in reality most of your signals are sparse in some basis. I.e. if you have a recording of music, in reality most of your harmonics have 0 amplitude, and only a very small subset of those are non-zero.

The problem of recovery of nn non-zero harmonics out of NN is generally NP-hard (this problem can be posed as a constrained optimization of L0 norm).

In 1980s it became clear, however, that you could relax this problem with L0 constraint to a convex problem with L1 constraint, which is still able to recover a small subset of non-zero coefficients, but does this in a small polynomial time.

Chen and Donoho applied this approach to photo compression problems with wavelet transform and achieved spectacular results, even achieving lossless compression.

wavelet_transform.png

This approach was popularized 1 year later in the field of statistics by Tibshirani and Hastie, and is now known as Lasso regression or regression with L1 regularization.

The compressed sensing problem could be formulated as follows: you are given a vector of nn measurements yRn\bf y \in \mathbb{R}^n (e.g. n=100n=100). Those measurement were performed by applying a sensing matrix AA to a high-dimensional vector xRn\bf x \in \mathbb{R}^n in NN-dimensional space, where e.g. N=40000N=40000.

The trick is that you know that out of the coordinates of x\bf x only a small subset is non-zero. The matrix AA is nn-by-NN, i.e. very wide and very thin. Now, using the Lasso magic you can try to recover your true vector x\bf x by approximating it with a vector xˉ\bf \bar{x}, being a solution of the following convex optimization problem:

{minxˉxˉ1y=Axˉ\begin{cases} \min \limits_{\bar{ {\bf x} }} ||\bar{ {\bf x} }||_1 \\ {\bf y} = A {\bf \bar{x}} \end{cases}

What’s left unclear is how to construct such a sensing matrix and in which basis our signal is expected to be sparse?

Here comes the method of random projections. In an older post I wrote about an incredibly beautiful and unbelievably powerful statement, called Johnson-Lindenstrauss lemma, which basically states that if you project any point cloud of finite amount of points (e.g. ~1000) of arbitrarily large dimensionality NN with e.g. gaussian random projection matrix onto a space of small dimensionality nn (in the ballpark of ~100-200 dimensions), all the distances between every pair of points will be preserved up to a tiny error with a probability, approaching 1.

Turns out, we can use this gaussian random projection basis (or other basis, satisfying some criteria) as a sensing basis for our signal, and we should be able to recover the full signal of e.g. 40000 harmonics with just a few hundred measurements, if that signal is sparse.

Theorem 1 (exact recovery of sparse signal in noiseless case is bounded by mutual coherence)

In reality our measurement devices are never perfect, and they generate some level of noise, and later on I will show that even noisy measurements allow for compressed sensing.

But first let us consider a “spherical horse in vacuum”, the perfect noiseless measurements case.

We understand now that we can do with nn measurements, which is much, much less than NN, but how big should nn be, precisely?

Turns out that nConstμ2(A)SlogNn \ge Const \cdot \mu^2(A) \cdot S \cdot log N is enough.

Here:

  • nn is the number of measurements required
  • NN is the size of our ambient space
  • SS is the maximum number of non-zero coordinates in the vector x\bf x we strive to recover (this number is called sparcity)
  • AA is our sensing matrix
  • μ(A)=maxk,jAk,j\mu(A) = \max \limits_{k, j} |A_{k, j}| is a bit tricky: it is a quantity, called incoherence

I’ll explain incoherence below.

Signal incoherence

To understand why signal incoherence is important let us consider a perfectly bad sensing matrix: let each row of our matrix be a Dirac’s delta, i.e. every measurement measures just a single coordinate of our data vector. How many measurements will it take to recover all the vector coordinates more or less reliably?

This is a coupon collector’s problem: on average NlogNN \cdot \log N measurements will be required to measure every single one of NN coordinates of our data vector.

In this case our sensing matrix is called perfectly coherent: all the “mass” of our measurement is concentrated in a single coordinate. Incoherent measurements matrices, on the other hand, are those that have their mass spread evenly. Note that we impose a condition that l2l2 norm of each vector of the sensing matrices is limited down to N\sqrt{N}.

Examples of incoherent and coherent sensing matrices

  • Dirac’s delta - poor sensing matrix, coherent, see coupon collector problem
  • Bernoulli random matrix - an incoherent sensing matrix
  • Gaussian random matrix - another incoherent sensing matrix
  • (possibly incomplete) Fourier Transform matrix - even better sensing matrix
  • random convolution - good sensing matrix (incoherence can be seen in Fourier domain)

See lecture by Emmanuel Candes.

Sensing matrix as a product of measurement matrix and sparsity inducing matrix

In practice sensing matrix AA is a bit constrained.

First, its rows are supposed to have an L2 norm of N\sqrt{N}. With this constraint μ(A)\mu(A) varies in range 1 to N\sqrt{N}. If the “mass” is evenly spread between rows of AA, the matrix is incoherent and is great for compressed sensing. If all the “mass” is concentrated in 1 value, the matrix is highly coherent, and we cannot get any advantage from compressed sensing.

Second, the sensing matrix of measurements cannot be arbitrary, it is determined by the physical nature of measurement process. Typical implementation of the measurement matrix is a Fourier transform matrix.

Third, our initial basis might not be the one, where the signal is sparse.

Hence, we might want to decompose our matrix AA into a product of two matrices: measurement matrix and sparsity basis matrix: A=ΦΨA = \Phi \Psi.

For instance, consider one vertical line of an image. The image itself is not sparse. However, if we applied a wavelet transform to it, in the wavelet basis it does become sparse. Hence, Ψ\Psi matrix is the matrix of wavelet transform. Now, Φ\Phi is the measurement matrix: e.g. if you have a digital camera with a CMOS array or some kind of a photofilm, the measurement matrix Φ\Phi might be something like a Fourier transform.

Now we can give a new interpretation to our signal incoherence requirement:

μ(A)=maxk,jϕk,ψj\mu(A) = \max \limits_{k, j} | \langle \phi_k, \psi_j \rangle |

This means that every measurement vector (e.g. Fourier transform vector) should be “spread out” in the sparsity-inducing basis, otherwise the compressed sensing magic won’t work. For instance, if l2-norm of our sensing matrix vectors is fixed to be N\sqrt{N}, if all the vector values are 0, and one value is N\sqrt{N}, the sensing magic won’t work, as all the “mass” is concentrated in one point. However, if it is spread evenly (e.g. all the coordinates of each vector have an absolute value of 1), that should work perfectly, and the required number of measurements nn would be minimal.

Theorem 1: exact formulation and proof outline

The exact formulation of Theorem 1 is probabilistic: set a small enough probability δ\delta, e.g. δ=0.001\delta=0.001.

Then with probability not less than δ\delta the exact recovery of SS-sparse vector x\bf x with a sensing matrix AA is exact, if the number of measurements nCSμ2(A)(logNlogδ)n \ge C \cdot S \cdot \mu^2(A) \cdot (\log N - \log \delta).

There are multiple proofs by Candes and Tao (an initial proof, specialized for partial Fourier transform as sensing matrix, its extension for arbitrary unitary sensing matrices and, finally, a simplified proof in Lecture 4 of the series of Cambridge lectures by Emmanuel Candes).

Step 1: dual certificate guarantees uniqueness of the solution

To prove this step I am following lecture 2 by E.Candes and a separate paper with a good proof.

Dual norm to the vector norm x||x|| is yd=maxx1y,x|| y ||_d = \max \limits_{|| x || \le 1} \langle y, x \rangle.

Lemma 1.1. Dual problem to the constrained norm minimization problem

Given a minimization problem minx\min || x ||, constrained on Ax=bAx = b, the corresponding dual problem to it is:

maxαα,b\max \limits_{\alpha} \langle \alpha, b \rangle, constrained on αTAd1|| \alpha^T A ||_d \le 1

To see this consider the Lagrangian:

L(x,α)=x+αT(bAx)\mathcal{L}(x, \alpha) = || x || + \alpha^T (b - Ax)

L(x,α)=x(1αTAxx)+αTb\mathcal{L}(x, \alpha) = ||x||(1 - \alpha^T A \frac{x}{||x||}) + \alpha^T b

Now, minimize this Lagrangian. Expression αTAxx\alpha^T A \frac{x}{||x||} is actually used in the dual norm: u=xxu = \frac{x}{||x||} is a vector with norm 1.

Hence, x=argminxL(x,α)=argmaxxαTAu=αTAdx^* = \arg \min \limits_x \mathcal{L}(x, \alpha) = \arg \max \limits_{x} \alpha^T A u = || \alpha^T A ||_d.

Now we see, that if αTAd1|| \alpha^T A ||_d \le 1, the Lagrangian attains its minimum at: L=αTb\mathcal{L}^* = \alpha^T b, when x0||x|| \to 0.

If αTAd>1|| \alpha^T A ||_d > 1, Lagrangian can be made arbitrarily small by increasing x||x|| \to \infty.

Lemma 1.2. Lasso dual problem

Observe that the norm, dual to L1 is L-infinity:

α=maxx1=1α,x|| \alpha ||_\infty = \max \limits_{||x||_1 = 1} \langle \alpha, x \rangle

Thus, by Lemma 1.1, Lasso dual problem is:

maxααTb\max \limits_{\alpha} \alpha^T b, given αTA1|| \alpha^T A ||_\infty \le 1.

Here is a graphical interpretation of the dual problem by Ryan Tibshirani

lasso dual Tibshirani

Graphical interpretation of lasso dual problem. Look at the right picture first. In a certain basis we know that constraints allows the feasible solutions v=ATαv = A^T \alpha (or XTαX^T \alpha) to lie within a cube. Now transform that space, applying AT1{A^T}^{-1} (or XT1{X^T}^{-1}) matrix, and we get some convex polytope, and strive to find such a projection direction α\alpha, that projection of bb (or yy) along it onto that polytope is as large as possible.

Lemma 1.3. For any feasible vector x, its dual equals the sign of x for non-zero coordinates x[i]

αTA[i]=sign(x[i])\alpha^T A [i] = \text{sign}(x [i]) if x[i]0x[i] \neq 0

Applying strong duality, we get:

x1=αTb=αTAx=iαTA[i]x[i]||x||_1 = \alpha^T b = \alpha^T A x = \sum_i \alpha^T A[i] x[i]

Hence, αTA[i]=sign(x[i])\alpha^T A [i] = \text{sign}(x [i]) (otherwise, some components will cancel each other out and we’ll achieve the value of L1 norm less than x1||x||_1, given that αTA1|| \alpha^T A ||_\infty \le 1).

Lemma 1.4. Uniqueness of dual certificate

I am following E.Candes lecture 2 here.

Suppose that we’ve managed to find a solution to our primal problem xˉ\bar{x}.

Let us prove that any other feasible solution x1x_1 is worse than xˉ\bar{x}. Denote SS the set of indices ii, where xˉ[i]\bar{x}[i] is non-zero. Denote ScS_c the complementary set of indices ii, where xˉ[i]=0\bar{x}[i] = 0.

Denote h=xˉxh = \bar{x} - x the vector of difference between xˉ\bar{x} and xx. Obviously, hh lies in the null space of operator AA, so that if Axˉ=yA\bar{x} = y, Ax=yAx = y as well. Thus, Ah=yAh = y, too.

Denote v={sign(xˉ[i]),if iSvi,wherevi1,if iScv = \begin{cases} \text{sign}(\bar{x}[i]), \text{if } i \in S \\ v_i, \text{where} |v_i| \le 1, \text{if } i \in S_c \end{cases} the dual certificate vector.

dual certificate

Slide from Cambridge lectures of E.Candes. Image displays a 2D case of our problem with L1 ball, null space of AA operator and vv dual certificate vector.

Now, suppose there is another vector xx, which has a smaller L1 norm than our solution xˉ\bar{x}. Let us show, that this is impossible.

Consider L1 norm of xx:

x1=xˉ+h1=xˉ+hS1+hSc1xˉ1+iSsign(xˉ)[i]hS[i]+hSc1||x||_1 = ||\bar{x} + h||_1 = || \bar{x} + h_S ||_1 + || h_{S_c} ||_1 \ge || \bar{x} ||_1 + \sum \limits_{i \in S} \text{sign}(\bar{x})[i]h_S[i] + || h_{S_c} ||_1

Now, note that iSsign(xˉ)[i]hS[i]\sum \limits_{i \in S} \text{sign}(\bar{x})[i]h_S[i] is part of our dual certificate here:

xˉ1+iSsign(xˉ)[i]hS[i]+hSc1=xˉ1+iSvihi+hSc1|| \bar{x} ||_1 + \sum \limits_{i \in S} \text{sign}(\bar{x})[i]h_S[i] + || h_{S_c} ||_1 = || \bar{x} ||_1 + \sum \limits_{i \in S} v_i h_i + || h_{S_c} ||_1 \ge

xˉ1+v,h+iSc(1vi)h1\ge || \bar{x} ||_1 + \langle v, h \rangle + \sum \limits_{i \in S_c}(1 - |v_i|) || h ||_1

Now recall that our dual certificate is orthogonal to the null space of AA, i.e. v,h=0\langle v, h \rangle = 0. Hence:

x1xˉ1+iSc(1vi)h1||x||_1 \ge || \bar{x} ||_1 + \sum \limits_{i \in S_c} (1 - |v_i|) || h ||_1

As the absolute value of each coordinate of the dual certificate on ScS_c set is smaller than 1, we just guaranteed that x1xˉ1||x||_1 \ge ||\bar{x}||_1, unless x=xˉx = \bar{x}.

Step 2: a way to construct a dual certificate

I am following this tutorial on compressed sensing.

Now we show that with overwhelming probability it is possible to construct the dual certificate.

v=ATAS(ASTAS)1sign(xˉ)Sv = A^T A_S (A_S^T A_S)^{-1} sign(\bar{x})_S and also vT=αTAv^T = \alpha^T A, hence, α=AS(ASTAS)1sign(xˉ)S\alpha = A_S (A_S^T A_S)^{-1}sign(\bar{x})_S.

Let us write SVD of AS=USVTA^S = U S V^T. Then ASTAS=VSUTUSVT=VS2VTA_S^T A_S = V S U^T U S V^T = V S^2 V^T, then (ASTAS)1=VTS2V1(A_S^T A_S)^{-1} = V^{-T} S^{-2} V^{-1}.

Then AS(ASTAS)1=USVTVTS2V1=US1V1A_S (A_S^T A_S)^{-1} = U S V^T V^{-T} S^{-2} V^{-1} = U S^{-1} V^{-1}, and recalling that VV is orthogonal, V1=VTV^{-1} = V^T, AS(ASTAS)1=US1V1=US1VTA_S (A_S^T A_S)^{-1} = U S^{-1} V^{-1} = U S^{-1} V^T.

Now suppose that the diagonal matrix of non-negative singular values SS has largest signular value σ1\sigma_1 and smallest singular value σS\sigma_S. Hence, the largest singular value of S1S^{-1} is 1σS\frac{1}{\sigma_S}.

This allows us to have an upper limit the L2 norm of our vector α\alpha:

α2=US1VTsign(xˉ)S2sign(xˉ)S2σS=SσS||\alpha||_2 = || U S^{-1} V^T sign(\bar{x})_S ||_2 \le \frac{||sign(\bar{x})_S||_2}{\sigma_S} = \frac{\sqrt{S}}{\sigma_S}

Now, recall that vT=αTAv^T = \alpha^T A or v=ATαv = A^T \alpha.

Consider each coordinate of the vector vi=AiTαv_i = A_i^T \alpha. Random variables AiTαα22\frac{A_i^T \alpha}{||\alpha||_2^2} are standard Gaussian random variables, hence, for them holds the following simple inequality:

p(ξt)2et22p(|\xi| \ge t) \le 2 e^{\frac{-t^2}{2}}

Applying this inequality, we get:

p(AiTα1)=p(AiTαα21α2)2e12α22=2eσS22Sp(| A_i^T \alpha | \ge 1) = p(\frac{| A_i^T \alpha|}{|| \alpha ||_2} \ge \frac{1}{||\alpha||_2}) \le 2e^{-\frac{1}{2 ||\alpha||^2_2}} = 2e^{-\frac{\sigma_S^2}{2S}}

Now, due to inequalities, such as Johnson-Lindenstrauss lemma, we can obtain limits on largest and smallest singular values of our matrix A_S: σS0.5n\sigma_S \ge 0.5 \sqrt{n}, where nn is the number of measurements in compressed sensing.

Hence p(AiTα1)2en8Sp(|| A_i^T \alpha || \ge 1) \le 2 e^{-\frac{n}{8S}}.

Now we want to limit the probability that vi1v_i \ge 1 for all iSci \in S^c of our vector vv.

Thus, we get p(vi1)=1(1p(AiTα1))NNp(AiTα1)Nen8Sp (\forall v_i \le 1) = 1 - (1 - p(|| A_i^T \alpha || \ge 1))^N \le N p(|| A_i^T \alpha || \ge 1) \le N e^{-\frac{n}{8S}}

and nCSlogp(vi1)logNn \ge C \cdot S \cdot \log p (\forall v_i \le 1) \cdot \log N

We now see how the required number of measurements nn depends on total dimensionality NN and sparsity SS, and we can choose it so that logp(vi1)\log p (\forall v_i \le 1) can be made arbitrarily small, and dual certificate is almost guaranteed to exist in practice.

Step 3 (optional): operator Bernstein inequality

I am following the appendix of this paper on matrix completion

Alternatively, one can use operator versions of exponentiated Markov inequalities.

Recall that ATA^T is a random matrix and apply operator form of Bernstein inequality (I will outline its proof in the next step) to the operator norm AT|| A^T || or to a random vector ATα|| A^T \alpha ||.

Operator form of Bernstein inequality is a stronger form of Bernstein inequality for random variables, which is, in turn, a strong alternative version of Markov inequality (a.k.a. Chebyshev’s first inequality in Russian tradition).

Recall Markov (a.k.a. Chebyshev’s first) inequality:

p(ξϵ)Eξϵp(\xi \ge \epsilon) \le \frac{\mathbb{E} \xi}{\epsilon}

If we potentiate both sides of inequality inside the probability term, a number of useful inequalities follow, such as Chernoff bound, Hoeffding inequality and Bernstein inequality. Specifically, Bernstein inequality for a sum of independent random variables states that:

p(1ni=1nξiϵ)enϵ22(1+ϵ3)p(|\frac{1}{n} \sum \limits_{i=1}^n \xi_i| \ge \epsilon) \le e^{- \frac{n \epsilon^2}{2 (1 + \frac{\epsilon}{3})}}

Now, instead of random variables ξi\xi_i consider random matrices AiA_i. Replace absolute value with spectral norm:

p(1ni=1nAiϵ)enϵ22(1+ϵ3)p(||\frac{1}{n} \sum \limits_{i=1}^n A_i|| \ge \epsilon) \le e^{- \frac{n \epsilon^2}{2 (1 + \frac{\epsilon}{3})}}

The proof can be seen here. It makes use of an operator version of Markov (Chebyshev’s first) inequality, operator Chernoff bound, Golden-Thompson inequality and, finally, operator Bernstein inequality.

Speaking informally, matrices AΩTA_{\Omega T} are almost orthogonal with overwhelming probability. Take note of this fact, it will be central for Theorems 2 and 3 in deterministic approach.

Robustness of compressed sensing

For compressed sensing to be applicable to real-life problems, the technique should be resistant to corruptions of the data. They are of two kinds:

  • First, in reality the vectors xx in the sensing basis might not be exactly SS-sparse, there might be some noise.
  • Second, the obtained measurements yiy_i in reality are likely to be corrupted by some noise as well.

Amazingly, compressed sensing is resistant to both of these problems, which is provable by theorems 2 and 3 respectively.

Restricted isometry property (RIP) and its flavours

While theorem 1 has a probabilistic nature, even more powerful theorems 2 and 3 have deterministic nature. To pose them we’ll need another property, called restricted isometry.

Recall that a crucial part of the proof of Theorem 1 was the fact that AΩTAΩTA_{\Omega T}^* A_{\Omega T} matrix was almost orthogonal. A similar requirement is made even stronger with Restricted Isometry Property (RIP).

For some constant, reflecting the sparsity of vector xSx_S, δS<1\delta_S < 1 any sub-matrix AΩTA_{\Omega T} of matrix AA:

(1δS)x2AΩTx2(1+δS)x2(1 - \delta_S) || x ||_2 \le || A_{\Omega T} x ||_2 \le (1 + \delta_S) || x ||_2

In plain english this means that if AA is a rectangular “wide” n×Nn \times N matrix, and xSx_S is a sparse vector with all of its NN coordinates zero, except by a subset of coordinates TT of cardianality T=S|T| = S, any sub-matrix AΩTA_{\Omega T} with all rows removed, but those in TT, is almost orthogonal and preserves the lengths of vectors xx it acts upon, up to a constant δS\delta_S. Almost-orthogonality of matrix AΩTA_{\Omega T} follows from length preservation because x2=xTx=xTAΩTTAΩTx|| x ||_2 = x^T x = x^T A_{\Omega T}^T A_{\Omega T} x.

In my opinion, even more useful form of an equivalent statement is distance preservation:

(1δS)x1x22AΩT(x1x2)2(1+δS)x1x22(1 - \delta_S) || x_1 - x_2 ||_2 \le || A_{\Omega T} (x_1 - x_2) ||_2 \le (1 + \delta_S) || x_1 - x_2 ||_2

If this requirement seems contrived to you, consider several practical examples of matrices, for which RIP holds.

Random Gaussian matrix: Johnson-Lindenstrauss lemma and random projections

Consider a n×Nn \times N matrix, of which each element is a realization of a random normal variable. These matrices are used for random projections.

For such matrices a super-useful Johnson-Lindenstrauss lemma is known, which states that with overwhelmingly high probability such matrices preserve distances between points and lengths of vectors.

Obviously any sub-matrix of a random Gaussian matrix is itself a random Gaussian matrix, so Johnson-Lindenstrauss lemma allows for a probabilistic version of RIP to hold, if we use random Gaussian projections as our sensing matrix.

Random Bernoulli/binary matrix: Kashin-Garnaev-Gluskin lemma

A similar case are random binary matrices: each element of such a matrix is a Bernoulli random variable, taking values of {1,1}\{1, -1\}.

TODO: Kolmogorov width and Gelfand width

TODO: Kashin-Garnaev-Gluskin lemma

Partial Fourier transform matrix: Rudelson-Vershinin lemma

A similar result was proven by Mark Rudelson and Roman Vershinin for partial Fourier transform matrices.

TODO: mention log5\log^5 convergence

Theorem 2 (robust recovery of a sparse subset of the largest vector coordinates, even if the vector is not sparse)

Suppose that in reality our vector x\bf x is not SS-sparse, but is corrupted with perturbations. Let us denote its S-sparse sub-vector with only S non-zero coordinates xS\bf x_S.

E.g. x=(0.1120.2){\bf x} = \begin{pmatrix} 0.1 \\ 1 \\ 2 \\ 0.2 \end{pmatrix} and xS=(0120){\bf x_S} = \begin{pmatrix} 0 \\ 1 \\ 2 \\ 0 \end{pmatrix} being S=2S=2-sparse.

Then:

xˉx1CSxxS1||{\bf \bar{x}} - {\bf x}||_1 \le C_S \cdot || {\bf x} - {\bf x_S} ||_1 and xˉx2CSxxS1S|| {\bf \bar{x}} - {\bf x}||_2 \le C_S \cdot \frac{ || {\bf x} - {\bf x_S} ||_1}{\sqrt{S}}

In plain english, l2 norm of the difference between the sparse vector xˉ\bf \bar{x}, recovered by compressed sensing, and SS-sparse subset of vector x\bf x is no greater than l2 norm of the remainder of x\bf x, after we subtract the vector XSX_S which contains the SS largest in absolute value coordinates from it.

Here is an example of reconstruction of an approximately (but not perfectly) sparse signal by compressed sensing. Note that some low-amplitude noise is discarded by compressed sensing, resulting in a slightly sparser signal, than it was originally:

approximately sparse signal

Recovery of approximately sparse signal by compressed sensing. Left image: original signal which is approximately sparse, but with some tiny noise. Right image: sparser signal, recovered by compressed sensing, is slightly more sparse and less noisy.

TODO: tightness of compressed sensing by Kashin-Garnaev-Gluskin lemma

Theorem 3 (robust recovery of sparse signal upon noisy measurements)

Now that “things got real”, let us make them even more real: our measurement process is imperfect, too. Assume that we have some noise in measurements, but it is bounded in l2 norm with a parameter ϵ\epsilon. Hence, we are solving the following problem now:

{minxˉ1Axˉy2ϵ\begin{cases} \min || {\bf \bar{x}} ||_1 \\ || A {\bf \bar{x}} - {\bf y} ||_2 \le \epsilon \end{cases}

Theorem 3 states that this problem can be solved efficiently, too:

xˉx2C0ϵmeasurement noise+C1xxS1Snoise in sparsity of recovered vector|| {\bf \bar{x}} - {\bf x} ||_2 \le \underbrace{C_0 \cdot \epsilon}_\text{measurement noise} + \underbrace{C_1 \cdot \frac{|| {\bf x} - {\bf x_S} ||_1}{\sqrt{S}}}_\text{noise in sparsity of recovered vector}

Putting this simply, the l2 error in recovery of the sparse signal is limited by a weighted sum of two errors: noise of measurement and non-exact sparseness of the input vector. If both of them are not too large, the recovery can be quite accurate.

Theorem 3 and Theorem 2 proof outlines

I am following two papers: a short one by E.Candes and a long one by the whole team.

Theorem 2 proof

We need to show that in the noiseless case (i.e. for any feasible xx, Ax=yAx = y):

xxˉ1СxˉSc1||x-\bar{x}||_1 \le С ||\bar{x}_{S^c}||_1

Or, denoting h=xxˉh = x-\bar{x}:

h1СxˉSc1||h||_1 \le С ||\bar{x}_{S^c}||_1

Lemma 2.1. L1 cube/cone constraint

First step to construct this inequality is the cone constraint:

xˉ1=xˉS1+xˉScxˉ+h1=xS0+hS01+xSc+hSc||\bar{x}||_1 = ||\bar{x}_S||_1 + ||\bar{x}_{S_c}|| \ge ||\bar{x} + h||_1 = ||x_{S_0} + h_{S_0}||_1 + ||x_{S_c} + h_{S_c}|| \ge

xˉS1hS1xˉSc1+hSc1\ge ||\bar{x}_S||_1 - ||h_{S}||_1 - ||\bar{x}_{S_c}||_1 + ||h_{S_c}||_1

xˉS1+xˉSc1xtrueS1hS1xˉSc1+hSc1||\bar{x}_S||_1 + ||\bar{x}_{S_c}||_1 \ge ||x_{true_S}||_1 - ||h_{S}||_1 - ||\bar{x}_{S_c}||_1 + ||h_{S_c}||_1

hS1hSc12xˉSc1||h_{S}||_1 \ge ||h_{S_c}||_1 - 2 ||\bar{x}_{S_c}||_1

or hSc1hS12xˉSc1||h_{S_c}||_1 - ||h_{S}||_1 \le 2 ||\bar{x}_{S_c}||_1

This is very close to what we need. If we could find a way to express hS1||h_{S}||_1 and hSc1||h_{S_c}||_1 through h1||h||_1 - we’d be done.

Lemma 2.2. RIP application

Here the RIP comes in handy:

0=Ah2=A(hS+j=1hSj)2AhS2j=1AhS2AhS02j=1AhSj20 = ||Ah||_2 = ||A (h_{S} + \sum \limits_{j=1} h_{S_j})||_2 \ge ||A h_{S}||_2 - ||\sum \limits_{j=1} A h_{S}||_2 \ge ||A h_{S_0}||_2 - \sum \limits_{j=1} ||A h_{S_j}||_2 \ge

(1δS)hS2(1+δSj)j=1hSj2\ge (1-\delta_{|S|})||h_{S}||_2 - (1+\delta_{|S_j|}) \sum \limits_{j=1} ||h_{S_j}||_2.

Equivalently, (1δS)hS2(1+δSj)j=1hSj2(1-\delta_S) ||h_{S}||_2 \le (1+\delta_{|S_j|}) \sum \limits_{j=1} ||h_{S_j}||_2.

Now what’s left is to convert these L2 norms to L1 norm. Luckily, we have:

Lemma 2.3. L2 norm bounds L1 from both top and bottom

x2x1dimxx2||x||_2 \le ||x||_1 \le \sqrt{\dim x} ||x||_2

Left part of the inequality holds due to Pythagoras theorem, right part - due to Cauchy-Schwarz inequality.

jhSj2SjjhSj1SjjhSj11\sum \limits_j ||h_{S_j}||_2 \le \sqrt{|S_j|} \sum \limits_j || h_{S_j} ||_{\infty} \le \frac{1}{\sqrt{|S_j|}} \sum \limits_j ||h_{S_{j-1}}||_1

Applying this to where we left, we get:

(1δS)hS1S(1δS)hS2(1+δSj)j=1hSj2(1+δSj)j=1hSj1=(1+δSj)hSc1\frac{(1-\delta_S)||h_S||_1 }{\sqrt{S}} \le (1-\delta_S) ||h_{S}||_2 \le (1+\delta_{|S_j|}) \sum \limits_{j=1} ||h_{S_j}||_2 \le (1+\delta_{|S_j|}) \sum \limits_{j=1} ||h_{S_j}||_1 = (1+\delta_{|S_j|}) ||h_{S_c}||_1

Hence, (1δS)S(1+δSc)hS1hSc1\frac{(1-\delta_S)}{\sqrt{S} (1+\delta_{S_c})} ||h_S||_1 \le ||h_{S_c}||_1

Identially, h1=hS1+hSc1hS1(1+(1δS)S(1+δSc))||h||_1 = ||h_S||_1 + ||h_{S_c}||_1 \ge ||h_{S}||_1 (1 + \frac{(1-\delta_S)}{\sqrt{S} (1+\delta_{S_c})})

Or, hS11(1+(1δS)S(1+δSc))h1||h_{S}||_1 \le \frac{1}{(1 + \frac{(1-\delta_S)}{\sqrt{S} (1+\delta_{S_c})})} ||h||_1

Substituting this back to hSc1hS1=h12hS12xˉSc1||h_{S_c}||_1 - ||h_{S}||_1 = ||h||_1 - 2 ||h_{S}||_1 \le 2 ||\bar{x}_{S_c}||_1, we get:

2xˉSc1h12hS1(121(1+(1δS)S(1+δSc)))h12 ||\bar{x}_{S_c}||_1 \ge ||h||_1 - 2 ||h_S||_1 \ge ( 1 - 2 \cdot \frac{1}{(1 + \frac{(1-\delta_S)}{\sqrt{S} (1+\delta_{S_c})})}) ||h||_1

xˉSc1(121(1+(1δS)S(1+δSc)))h1||\bar{x}_{S_c}||_1 \ge ( \frac{1}{2} - \frac{1}{(1 + \frac{(1-\delta_S)}{\sqrt{S} (1+\delta_{S_c})})}) ||h||_1 or h1CSxˉSc1||h||_1 \le C_S ||\bar{x}_{S_c}||_1.

The theorem also contains a clause h2CSxˉSc1S||h||_2 \le C_S \frac{||\bar{x}_{S_c}||_1}{\sqrt{S}}, which follows the fact that h2h1||h||_2 \le ||h||_1. Not sure about the exact origins of 1S\frac{1}{\sqrt{S}} multiplier, though.

This concludes the proof.

Theorem 3 proof

We need to show that:

xˉx2CxˉSc1+C1ϵ|| \bar{x} - x ||_2 \le C || \bar{x}_{S_c}||_1 + C_1 \epsilon

Theorem 3 follows the general logic of Theorem 2, but now our measurements are noisy, so the application of RIP now requires a Tube constraint.

Lemma 3.1.Tube constraint

Suppose that true point xˉ\bar{x} resulted in a measurement yy, but due to noise a different solution xx was found. This solution is still going to lie within a tube of radius ϵ\epsilon, surrounding the null space of AA:

A(xˉx)2=(Axˉy)(Axy)2Axˉy2+Axy22ϵ|| A (\bar{x} - x) ||_2 = ||(A\bar{x} - y) - (Ax - y)||_2 \le || A \bar{x} - y ||_2 + || A x - y ||_2 \le 2 \epsilon

Both xˉ\bar{x} and xx are feasible solutions, resulting in measurement yy with some l2 noise.

theorem 3 geometry in 2D

If we substitute the result of lemma 3.1 into the lemma 2.2, we get:

(1δS)ShS1(1+δSc)hSc1+2ϵ\frac{(1-\delta_S)}{\sqrt{S} } ||h_S||_1 \le (1+\delta_{S_c}) ||h_{S_c}||_1 + 2 \epsilon or hS1S(1+δSc)1δShSc1+2ϵS1δS||h_S||_1 \le \frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} ||h_{S_c}||_1 + 2 \epsilon \frac{\sqrt{S}}{1-\delta_S}

Now apply lemma 2.1:

hSc1hS12xˉSc1||h_{S_c}||_1 - ||h_{S}||_1 \le 2 ||\bar{x}_{S_c}||_1, hence:

hSc12xˉSc1hS1S(1+δSc)1δShSc1+2ϵS1δS||h_{S_c}||_1 - 2||\bar{x}_{S_c}||_1 \le ||h_{S}||_1 \le \frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} ||h_{S_c}||_1 + 2 \epsilon \frac{\sqrt{S}}{1-\delta_S}

hSc1(1S(1+δSc)1δS)2xˉSc1+2ϵS1δS||h_{S_c}||_1 (1 - \frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S}) \le 2||\bar{x}_{S_c}||_1 + 2 \epsilon \frac{\sqrt{S}}{1-\delta_S}

hSc1(1S(1+δSc)1δS)2xˉSc1+2ϵS1δS||h_{S_c}||_1 (1 - \frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S}) \le 2||\bar{x}_{S_c}||_1 + 2 \epsilon \frac{\sqrt{S}}{1-\delta_S} or

hSc1(S(1+δSc)1δS1)2xˉSc1+2ϵS1δS||h_{S_c}||_1 (\frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} - 1) \ge 2||\bar{x}_{S_c}||_1 + 2 \epsilon \frac{\sqrt{S}}{1-\delta_S}

Now recall that 2hSc1h12xˉSc12||h_{S_c}||_1 - ||h||_1 \le 2 ||\bar{x}_{S_c} ||_1 or hSc1xˉSc112h1||h_{S_c}||_1 \le ||\bar{x}_{S_c} ||_1 - \frac{1}{2} ||h||_1 and get:

(S(1+δSc)1δS1)(xˉSc112h1)hSc1(S(1+δSc)1δS1)2xˉSc1+2ϵS1δS(\frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} - 1) (||\bar{x}_{S_c} ||_1 - \frac{1}{2} ||h||_1) \ge ||h_{S_c}||_1 (\frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} - 1) \ge 2||\bar{x}_{S_c}||_1 + 2 \epsilon \frac{\sqrt{S}}{1-\delta_S}

(S(1+δSc)1δS3)xˉSc12ϵS1δSh112(S(1+δSc)1δS1)(\frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} - 3) ||\bar{x}_{S_c} ||_1 - 2 \epsilon \frac{\sqrt{S}}{1-\delta_S} \ge ||h||_1 \cdot \frac{1}{2} \cdot (\frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} - 1)

h112(S(1+δSc)1δS1)(S(1+δSc)1δS3)xˉSc12ϵS1δS||h||_1 \cdot \frac{1}{2} \cdot (\frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} - 1) \le (\frac{\sqrt{S} \cdot (1+\delta_{S_c})}{1-\delta_S} - 3) ||\bar{x}_{S_c} ||_1 - 2 \epsilon \frac{\sqrt{S}}{1-\delta_S}

h1C1xˉSc1+C2ϵ||h||_1 \le C_1 || \bar{x}_{S_c} ||_1 + C_2 \epsilon

As h2h1||h||_2 \le ||h||_1, this leads to:

h2C1xˉSc1+C2ϵ||h||_2 \le C_1 || \bar{x}_{S_c} ||_1 + C_2 \epsilon.

Practical example

Let us generate a wav-file with a simple chord:

piano roll

A-E-A chord

Let us take a perfect piano and consider a Discrete Cosine Transform of A-E-A chord (I don’t want to deal with complex numbers in this case, otherwise, I’d use Fourier).

I’ll open a Jupyter-notebook and programmatically generate DCT spectrum of this signal, then synthesize an artificial wav-file with sound pressures out of it and play it:

import numpy as np
import matplotlib.pyplot as plt
from scipy import fftpack
from scipy.io import wavfile
from IPython.display import Audio

dim = 48000

a_e_a_chord_dct = np.zeros(dim)

a_e_a_chord_dct[440] = 1
a_e_a_chord_dct[660] = 1
a_e_a_chord_dct[880] = 1


generated_wav = idct(a_e_a_chord_dct)

Audio(generated_wav, rate=dim)

Sounds like a chord, right?

Now, let us show that with compressed sensing we don’t have to measure our sound at 48 000 points a second, we will be able to recover the initial signal pretty accurately from just 100 measurement points.

import random
import math

from sklearn import linear_model


# generate a DCT matrix to select just a few random rows from it and use them as a sensing matrix
I = np.identity(dim)
dct_matrix = dct(I)

n_measurements = 100  # measure the sound pressure at this amount of points 


def get_random_indices(n_indices=n_measurements, dim=dim):
    output = []
    counter = 0
    while counter < n_indices:
        prospective_random_index = math.floor(random.random() * dim)
        if prospective_random_index not in output:
            output.append(prospective_random_index)
            counter += 1
        else:
            pass  # redraw, if random index was already used
    
    return output


# select a few rows from our DCT matrix to use this partial DCT matrix as sensing matrix
random_indices = get_random_indices()
partial_dct = np.take(dct_matrix, get_random_indices(), axis=1)

# measure sound pressures at a small number of points by applying sensing matrix to the signal spectrum
y = np.dot(partial_dct.T, a_e_a_chord_dct)

# make the measurement noisy
noise = 0.01 * np.random.normal(0, 1, n_measurements)
signal = y + noise

# recover the initial spectrum from the measurements with L1-regularized regression (implemented as sklearn's Lasso in this case)
lasso = linear_model.Lasso(alpha=0.1)
lasso.fit(partial_dct.T, signal)
for index, i in enumerate(lasso.coef_):
    if i > 0:
        print(f"index = {index}, coefficient = {i}")
index = 440, coefficient = 0.9322433129902213
index = 660, coefficient = 0.9479368288017439
index = 880, coefficient = 0.9429783262003731

Here we go, using just 100 noisy measurements we’ve managed to perfectly recover the spectrum of our signal, and made just 6-7% error in amplitudes.

Let us listen to the result:

recovered_wav = idct(lasso.coef_)
Audio(recovered_wav, rate=dim)

Can your ear tell the difference? I can not.

Magic!

References


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram