Multivariate normal distribution arises in many aspects of mathematical statistics and machine learning. For instance, Cochran's theorem in statistics, PCA and Gaussian processes in ML heavily rely on its properties. Thus, I'll discuss it here in detail.
Sir Francis Galton (1822 - 1911)
Sir Francis Galton might be considered the grandfather of modern statistics. He analyzed the data on height of parents and their children:
Heights of 205 parents and 930 adult children, from “Regression towards mediocrity in hereditary stature” by F.Galton, 1886
Heights of parents were normally-distributed, as well as the heights of their children. However, the distributions were
obviously not independent, as taller parents generally give birth to taller children. Probably, this plate is one of the first depictions of 2-variate normal distribution’s isocontour. As a side note,
Galton has also come up with the ratio of male to female heights here, which is 1.08 (according to the modern data, it is closer to 1.07).
Galton actually rediscovered the concept of correlation two years after this paper, in 1888.
Multivariate normal distribution
A random vector is called multivariate normal distribution, if each dimension of it represents a one-dimensional normal distribution.
They write , where is a vector of means, and the elements of matrix are covariances between pairs of individual coordinates (, ):
Cumulative density function of X looks like this:
and probability density function is:
Mahalanobis distance and covariance matrix
What is the meaning of covariance matrix and what does it do in the probability density function of multivariate normal?
Square root of a quadratic form , where and are n-vectors and is an n x n matrix, is
called Mahalanobis distance between vectors and .
If the matrix is a unit matrix, e.g. , Mahalanobis distance is the same as Euclidean.
However, if the coordinates of the vector X are strongly correlated, Mahalanobis distance could be much more helpful to e.g. detect outliers.
For instance, imagine, that your vector contains flat properties: ( = total_area, = living_rooms_area, = distance from center).
You can tell that total flat area and living rooms area have a reasonably strong correlation (as an edge case they could completely duplicate each other).
For instance, here is a possible covariance matrix for your flat’s properties , .
The key to understanding the covariance matrix is analysis of its eigen decomposition. Let be the matrix of eigenvectors of , let be the diagonal matrix of eigenvalues of .
Covariance matrix is symmetric (and positively semi-definite). For a symmetric matrix, its eigenvectors are orthogonal (so that inverse matrix of an orthogonal matrix is its transpose): => .
, thus, , indicating that , or is orthogonal.
So the logic of Mahalanobis distance can be seen as follows: .
By multiplying by the inverse/transposed eigen matrix (and doing the same in transposed way to the left side from , when multiplying ), we de-correlate the dimensions of the vectors, transforming those inter-dependent factors into orthogonal, independent.
Then we take the sum of squares of those de-correlated factors, but a weighted one, we give some dimensions more weight then the others, by multiplying by the matrix of eigenvalues .
Let us show that correlated vectors, multiplied by , become uncorrelated. If eigenvector had coordinates , then:
Let’s now calculate the correlation between two coordinates of , e.g.:
First, we used the fact that covariance of a linear combination of random variables is a linear combination of covariances. Then we used the fact that is the eigenvector of matrix , and . Lastly, we used the fact that eigenvectors and are orthogonal, and their dot product is 0.
Now, as you can see, the power of exponent in multivariate normal distribution, is the square of Mahalanobis distance between the vector and its mean, divided by 2.
So, it works in the same way, it converts our correlated factors into uncorrelated ones, and takes sum of their squares, weighted by eigenvalues of respective directions.
This also explains, why the denominator contains : the eigenvalues of the covariance matrix are the elements of diagonal matrix , which are the variances of de-correlated normal distributions.
By Binet-Cauchy formula the determinant of . Thus, by normalizing the probability density function by ,
we do the same as by normalizing pdf of one-dimensional normal distribution by .
Uncorrelated multidimensional normal variables are independent
This property of multidimensional normal distribution is fairly obvious from the previous property.
For instance, suppose that your covariance matrix is as in the following example from StackOverflow:
Now, if we substitute this into the probability density function, we get:
Thus, we can see that uncorrelated dimensions of random vector can be factored-out into independent random variables.
Marginalization and Conditioning
You marginalize multivariate normal distribution by taking an integral over 1 of its dimensions.
For instance, if you integrate Galton’s 2-variate normal distribution over the heights of all the fathers, you get 1-dimensional distribution of heights all the children.
You do conditioning, when you fix the value of one dimension of multivariate normal distribution and achieve a lower-variate one.
For instance, you can choose fathers, who are inches tall, and achieve the conditional distribution of
heights of their children, which is one-dimensional normal:
Note that the mean and variance of this distribution differ from the marginalized one - children of taller fathers are, obviously, taller.
Quadratic forms, their ranks and special cases of quadratic forms
The power of exponent of p.d.f. of a multivariate normal is a quadratic form.
Speaking of the matrix , there is a useful concept of matrix rank, which is the number of linearly independent rows/columns in the matrix.
For instance, if our quadratic form is just a product of two vectors , the rank equals 1, because all the rows are linearly dependent.
Indeed, we could see this, when we write the product in matrix notation:
The multiplication of row-vector by column-vector in linear algebra is called dot product, or inner product:
Less commonly used, the multiplication of columns-vector by row vector is called an outer product, and it results in a matrix, where each element is a product of respective elements of column-vector and row-vector:
You can choose an arbitrary order of application of outer-product and inner-product operations:
If you choose the latter way, it becomes obvious that the rank of matrix
, formed by an outer product of coefficients and , equals 1.
Indeed, its i-th row is a multiple of by row-vector , so
all the rows differ just by a scalar , so there is just 1 linearly independent row.
If we chose a real-life covariance matrix , it is clear that its rank equals 2, so it cannot be represented as .