cover

Pearson's Chi-square test - intuition and derivation

June 17, 2021 23 min read

Here I discuss, how an average mathematically inclined person like myself could stumble upon Karl Pearson's chi-squared test (it doesn't seem intuitive at all from the first glance). I demonstrate the intuition behind it and then prove its applicability to multinomial distribution.

In one of my previous posts I derived the chi-square distribution for sum of squares of Gaussian random variables and showed that it is a special case of Gamma distribution and very similar to Erlang distribution. You can look them up for reference.

A motivational example

Suppose you’ve taken 5 samples of a random variable that you assumed to be standard normal Gaussian and received suspicious results: you have a feeling that your dots have landed too far away from the peak of distribution.

Gaussian distribution observed dispersion

Sample 5 points from standard normal distribution. In this example sampled points are very far from distribution’s mode, this is very, very unlikely.

Given the probability density function (abbreviated pdf) of a gaussian distribution and knowledge that standard deviation is 1, you would expect a much more close distribution of your samples, something like this:

Gaussian distribution expected dispersion

Another sample of 5 points from standard normal distribution. This result is much more typical.

You could ask yourself a question: what is the average value of probability density function that I would observe, when sampling a point from standard normal?

Well, you can actually calculate it, following the expectation formula: Eξ[pdf]=+12πex22This is pdf, i.e. the random variable you’re averaging12πex22This is pdf of ξ over which you’re averagingdx=12π+ex2dx \mathbb{E}_\xi[pdf] = \int \limits_{-\infty}^{+\infty} \underbrace{ \frac{1}{\sqrt{2 \pi}} e^{-\frac{x^2}{2}} }_\text{This is pdf, i.e. the random variable you're averaging} \cdot \underbrace{ \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}}_{\text{This is pdf of } \xi \text{ over which you're averaging}} \cdot dx = \frac{1}{ 2\pi } \int \limits_{-\infty}^{+\infty} e^{-x^2} dx.

Now, substitute t=2xt = \sqrt{2}x: 12π+ex2dx=12π+et22d(t2)=122π+et22dt=2π22π2π=122π\frac{1}{ 2\pi } \int \limits_{-\infty}^{+\infty} e^{-x^2} dx = \frac{1}{ 2\pi } \int \limits_{-\infty}^{+\infty} e^{-\frac{t^2}{2}} d(\frac{t}{\sqrt2}) = \frac{1}{\sqrt{2} \cdot 2\pi } \int \limits_{-\infty}^{+\infty} e^{-\frac{t^2}{2}} dt = \frac{\cancel{\sqrt{2\pi}}}{\sqrt{2}\cdot \sqrt{2\pi} \cancel{\sqrt{2\pi}}} = \frac{1}{\sqrt{2} \sqrt{2\pi}}.

So, probability density function of an average point, you sample, is expected to be 122π=0.7071060.398942=0.282095\frac{1}{\sqrt{2} \sqrt{2 \pi}} = 0.707106 \cdot 0.398942 = 0.282095.

Let us calculate pdf for several points:

if x=0 (which is the most probable point), your fξ(0)=12πe0/2=0.398942f_\xi(0)=\frac{1}{ \sqrt{2\pi} } e^{-0/2} = 0.398942, a bit more probable than average.

if x=1 (which is the standard deviation), your fξ(1)=12πe1/2=0.241970f_\xi(1)=\frac{1}{ \sqrt{2\pi} } e^{-1/2} = 0.241970, a bit less probable than average.

if x=2 (which is 2 standard deviations), your fξ(2)=12πe4/2=0.053989f_\xi(2)=\frac{1}{ \sqrt{2\pi} } e^{-4/2} = 0.053989, not much.

if x=3 (which is 3 standard deviations), your fξ(3)=12πe9/2=0.004432f_\xi(3)=\frac{1}{ \sqrt{2\pi} } e^{-9/2} = 0.004432, very small.

With these numbers let’s return to the 5 points I’ve observed. Out of those five point two points are at 3 standard deviations, two points are at 2 standard deviations and one point is at 1 standard deviation. So the probability density function of such an observation is fξ2(3)fξ2(2)fξ(1)=0.00443220.05398920.241970=1.38108f^2_\xi(3) \cdot f^2_\xi(2) \cdot f_\xi(1) = 0.004432^2 \cdot 0.053989^2 \cdot 0.241970 = 1.38 \cdot 10^{-8}.

At the same time the expected pdf of five average point to be observed is 0.2820955=0.00178630.282095^5 = 0.0017863.

Seems like our observation was a really improbable one, it is less probable than average by over 100 000 times.

Intuition behind Pearson’s chi-square test

We need a more rigorous tool than just comparison of the probability of our observed five points with the average.

Let us call our vector of observations X=(x1,x2,x3,x4,x5)TX = (x_1, x_2, x_3, x_4, x_5)^T. The combined pdf to observe exactly our 5 points is a 5-dimensional multidimensional standard normal distribution 12π5ex12+x22+x32+x42+x522\frac{1}{\sqrt{2\pi}^5} e^{-\frac{x_1^2 + x_2^2 + x_3^2 + x_4^2 + x_5^2}{2}}.

But note that we don’t want exactly our point. Any “equiprobable” point will do. For instance, e1+1+1+1+12=e0+2+1+1+12e^{-\frac{1+1+1+1+1}{2}} = e^{-\frac{0+2+1+1+1}{2}}, the 5-dimensional points (1,1,1,1,1)T(1,1,1,1,1)^T and (0,2,1,1,1)T(0,\sqrt{2},1,1,1)^T are “equiprobable”, and we want to group them into one.

So, we are actually interested in the distribution of the sum x12+x22+x32+x42+x52x_1^2+x_2^2+x_3^2+x_4^2+x_5^2 as for identical values of the sum the pdfs of likelihood of a vector of observations X=(x1,x2,x3,x4,x5)TX = (x_1, x_2, x_3, x_4, x_5)^T are identical.

Each xiN(0,1)x_i \sim \mathcal{N}(0, 1) is a standard Gaussian-distributed random variable, so the sum in question is a random variable distributed as Chi-square: iNxi2χN2\sum \limits_{i}^{N} x_i^2 \sim \chi^2_N. Please, refer to my older post on Gamma/Erlang/Chi-square distribution for the proof.

That’s why chi-square distribution is the right tool to answer the question, we’re interested in: is it likely, that the observed set of points was sampled from a standard normal distribution.

Derivation of Pearson’s goodness of fit test statistic

The chi-square test is widely used to validate the hypothesis that a number of samples were taken from a multinomial distribution.

Suppose you’ve rolled a k=6k=6-sided dice n=120n=120 times, and you expect it to be fair. You would expect Ei=20E_i=20 occurrences of each value of the cube ii \in {1,2,3,4,5,6} (row E - expected), instead you see some different outcomes OiO_i (row O - observed):

1 2 3 4 5 6
E 20 20 20 20 20 20
O 15 14 22 21 25 23

We need to estimate the likelihood of an outcome like this, if the dice was fair. Turns out that a certain statistic based on these data follows the chi-square distribution: χk12=ik(OiEi)2Ei\chi_{k-1}^2 =\sum \limits_{i}^{k} \frac{(O_i-E_i)^2}{E_i}. I’ll prove this fact here by induction for increasing number of dice sides kk loosely following some constructs from de Moivre-Laplace theorem (which is a special case of Central Limit Theorem, proven before the adoption of much more convenient Fourier analysis techniques).

Induction base: 2-sided dice (coin)

i=1k=2(OiEi)2Ei=(O1np)2np+((nO1)n(1p))2n(1p)=(O1np)2np+(O1np)2n(1p)=(O1np)2(1p)+(O1np)2pnp(1p)=(O1np)2np(1p)\sum \limits_{i=1}^{k=2} \frac{(O_i - E_i)^2}{E_i} = \frac{(O_1 - np)^2}{np} + \frac{((n-O_1) - n(1-p))^2}{n(1-p)} = \frac{(O_1-np)^2}{np} + \frac{(O_1-np)^2}{n(1-p)} = \frac{(O_1 -np)^2(1-p) + (O_1 -np)^2p}{np(1-p)} = \frac{(O_1-np)^2}{np(1-p)}

Now recall that ξ1=O1npnp(1p)N(0,1)\xi_1 = \frac{O_1-np}{\sqrt{np(1-p)}} \sim \mathcal{N}(0, 1).

Thus, ξ12=(O1np)2np(1p)χ12\xi_1^2 = \frac{(O_1-np)^2}{np(1-p)} \sim \chi^2_1.

Induction step: (k+1)-sided dice from k-sided dice

In order to perform the induction step, we need to show that if we had a k-dice described by a k-nomial distribution and could apply χk1\chi_{k-1} test to it, there is a way to construct a dice with k+1 sides from it, so that it can be described by a (k+1)-nomial distribution and validated by a χk\chi_{k} test.

Here’s an example to illustrate the process of creation of an additional side of our dice.

Imagine that we had a continuous random variable, representing the distribution of human heights. We transformed the real-valued random variable of height into a discrete one by dividing all people into one of two bins: “height < 170cm” and “height >= 170cm” with probabilities p1p_1 and p2p_2, so that p1+p2=1p_1+p_2 = 1. Chi-square test works for our data according to induction basis. We can see sampling of a value from this r.v. as a 2-dice roll (or a coin flip).

Now we decided to split the second bin (“height >= 170cm”) into two separate bins: “170cm <= height < 180cm” and “height >= 180cm”. So, what used to be one side of our 2-dice has become 2 sides, and now we have a 3-dice. Let’s show that Chi-squared test will just get another degree of freedom (it’s going to be χ22\chi_2^2 instead of χ12\chi_1^2 now), but will still work for our new 3-dice.

Statement 1. Pearson’s test’s statistic is a sum of chi-square-distributed first term and a simple second term

Let’s write down the formula of Pearson’s test for 3-nomial disitrubion and split it into binomial part and an additional term:

i=13(Oinpi)2npi=(O1np1)2np1+(O2np2)2np2+(O3np3)2np3=(O1np1)2np1+(O2+O3n(p2+p3))2n(p2+p3)j=12(Ojnpj)2npjχ12 for sum of k=2 terms by induction base(O2+O3n(p2+p3))2n(p2+p3)+(O2np2)2np2+(O3np3)2np3this part should also be χ12, let’s prove this\sum \limits_{i=1}^{3} \frac{(O_i - np_i)^2}{np_i} = \frac{(O_1 - np_1)^2}{np_1} + \frac{(O_2 - np_2)^2}{np_2} + \frac{(O_3 - np_3)^2}{np_3} = \underbrace{\frac{(O_1 - np_1)^2}{np_1} + \frac{(O_2 + O_3 - n(p_2 + p_3))^2}{n(p_2 + p_3)}}_{\sum \limits_{j=1}^2 \frac{(O_j'-np_j)^2}{np_j} \sim \chi_1^2 \text{ for sum of k=2 terms by induction base}} - \underbrace{\frac{(O_2 + O_3 - n(p_2 + p_3))^2}{n(p_2 + p_3)} + \frac{(O_2 - np_2)^2}{np_2} + \frac{(O_3 - np_3)^2}{np_3}}_{\text{this part should also be } \sim \chi_1^2 \text{, let's prove this}}

Let us focus on the second term and simplify it:

(O2+O3n(p2+p3))2n(p2+p3)+(O2np2)2np2+(O3np3)2np3=(O2+O3n(p2+p3))2p2p3n(p2+p3)p2p3+(O2np2)2(p2+p3)p3np2(p2+p3)p3+(O3np3)2(p2+p3)p2np3(p2+p3)p2=- \frac{(O_2 + O_3 - n(p_2 + p_3))^2}{n(p_2 + p_3)} + \frac{(O_2 - np_2)^2}{np_2} + \frac{(O_3 - np_3)^2}{np_3} = -\frac{(O_2 + O_3 - n(p_2 + p_3))^2\cdot p_2 p_3}{n(p_2 + p_3) \cdot p_2p_3} + \frac{(O_2 - np_2)^2 \cdot (p_2 + p_3)p_3}{np_2 \cdot (p_2 + p_3)p_3} + \frac{(O_3 - np_3)^2 \cdot (p_2 + p_3)p_2}{np_3 \cdot (p_2 + p_3)p_2} =

=(O222O2np2+n2p22)(p2p3+p32)+(O322O3np3+n2p32)(p22+p2p3)((O2+O3)22(O2+O3)n(p2+p3)+n2(p22+2p2p3+p32))p2p3np2p3(p2+p3)== \frac{ (O_2^2 - 2O_2np_2 + n^2p_2^2)(p_2p_3 + p_3^2) + (O_3^2 - 2O_3np_3 + n^2p_3^2)(p_2^2 + p_2p_3) - ((O_2+O_3)^2 - 2(O_2+O_3)n(p_2+p_3) + n^2(p_2^2 + 2p_2p_3 + p_3^2))p_2p_3 }{np_2p_3(p_2+p_3)} =

=(O22p2p3+O22p322O2np22p32O2np2p32+n2p23p3+n2p22p32)+(O32p22+O32p2p32O3np3p222O3np2p32+n2p22p32+n2p2p33)(O22p2p3+2O2O3p2p3+O32p2p32O2np22p32O2np2p322O3np22p32O3np2p32+n2p23p3+2n2p22p32+n2p2p33)np2p3(p2+p3)== \frac{(\cancel{O_2^2p_2p_3} + O_2^2p_3^2 - \cancel{2O_2np_2^2p_3} - \cancel{2O_2np_2p_3^2} + \cancel{n^2p_2^3p_3} + \cancel{n^2p_2^2p_3^2}) + (O_3^2p_2^2 + \cancel{O_3^2p_2p_3} - \cancel{2O_3np_3p_2^2} - \cancel{2O_3np_2p_3^2} + \cancel{n^2p_2^2p_3^2} + \cancel{n^2p_2p_3^3} ) - (\cancel{O_2^2p_2p_3} + 2O_2O_3p_2p_3 + \cancel{O_3^2p_2p_3} - \cancel{2O_2np_2^2p_3} - \cancel{2O_2np_2p_3^2} - \cancel{2O_3np_2^2p_3} - \cancel{2O_3np_2p_3^2} + \cancel{n^2p_2^3p_3} + \cancel{2n^2p_2^2p_3^2} + \cancel{n^2p_2p_3^3}) }{np_2p_3(p_2+p_3)} =

=O22p32+O32p222O2O3p2p3np2p3(p2+p3)=(O2p3O3p2)2np2p3(p2+p3) = \frac{O_2^2p_3^2 + O_3^2p_2^2 - 2O_2O_3p_2p_3}{np_2p_3(p_2+p_3)} = \frac{(O_2p_3 - O_3p_2)^2}{np_2p_3(p_2+p_3)}.

Statement 2. Second term is a square of standard normal random variable

The random variable that we’ve received has a χ12\chi_1^2 distribution because it is a square of ξ=O2p3O3p2np2p3(p2+p3)\xi = \frac{O_2p_3 - O_3p_2}{\sqrt{np_2p_3(p_2+p_3)}} random variable, which is a standard normal one.

Let’s show this fact: indeed O2O_2 and O3O_3 are gaussian r.v. (by de Moivre-Laplace/C.L.T.) with expectations of E[O2]=np2\mathbb{E}[O_2] = np_2 and E[O3]=np3\mathbb{E}[O_3] = np_3 and variances Var[O2]=np2(1p2)Var[O_2] = np_2(1-p_2) and Var[O3]=np3(1p3)Var[O_3] = np_3(1-p_3) respectively.

Sum of 2 gaussian random variables is gaussian with expectation equal to sum of expectations and variance equal to sum of variances, plus covariance: σX+Y=σX2+σY2+2ρσXσY\sigma_{X+Y} = \sqrt{\sigma_{X}^2 + \sigma_{Y}^2 + 2 \rho \sigma_{X} \sigma_{Y}}. This fact can be proved using either convolutions or Fourier transform (traditionally known as characteristic functions in the theory of probabilities field).

E[ξ]=np2p3np3p2np2p3(p2+p3)=0\mathbb{E}[\xi] = \frac{np_2p_3 - np_3p_2}{\sqrt{n p_2 p_3 (p_2 + p_3)}} = 0

Now, let us calculate the variance of ξ\xi. Our random variables O2O_2 and O3O_3 are non-independent, so recall that the variance of their difference is the sum of individual variances, minus the double covariance:

Var[ξ]=Var[p3O2p2O3np2p3(p2+p3)]=1n2p22p32(p2+p3)2Var[p3O2p2O3]=1n2p22p32(p2+p3)2(Var[p3O2]+Var[p2O3]2Cov[p3O2,p2O3])=Var[\xi] = Var[\frac{p_3 O_2 - p_2 O_3}{n p_2 p_3 (p_2 + p_3)}] = \frac{1}{n^2 p^2_2 p^2_3 (p_2 + p_3)^2} Var[p_3 O_2 - p_2 O_3] = \frac{1}{n^2 p^2_2 p^2_3 (p_2 + p_3)^2} (Var[p_3 O_2] + Var[p_2 O_3] - 2 Cov[p_3 O_2, p_2 O_3]) =

=1n2p22p32(p2+p3)2(Var[O2]p32+Var[O3]p222p2p3Cov[O2,O3])=np2(1p2)p32+np3(1p3)p222p2p3Cov[O2,O3]n2p22p32(p2+p3)2=np2p3(p2+p3)2np22p322p2p3Cov[O2,O3]n2p22p32(p2+p3)2. = \frac{1}{n^2 p^2_2 p^2_3 (p_2 + p_3)^2} (Var[O_2] \cdot p_3^2 + Var[O_3] \cdot p_2^2 - 2 p_2 p_3 \cdot Cov[O_2, O_3]) = \frac{np_2(1-p_2)p_3^2 + np_3(1-p_3)p_2^2 - 2 p_2 p_3 \cdot Cov[O_2, O_3]}{n^2p^2_2p^2_3(p_2+p_3)^2} = \frac{n p_2 p_3 (p_2 + p_3) - 2 n p_2^2 p_3^2 - 2 p_2 p_3 \cdot Cov[O_2, O_3]}{n^2p^2_2p^2_3(p_2+p_3)^2}.

What is the covariance Cov[O2,O3]Cov[O_2, O_3]? It is a well-known covariance of multinomial random variables Cov[O2,O3]=np2p3Cov[O_2, O_3] = -n p_2 p_3.

Hence, Var[ξ]=np2p3(p2+p3)2np22p322p2p3Cov[O2,O3]np2p3(p2+p3)=np2p3(p2+p3)2np22p322p2p3(np2p3)np2p3(p2+p3)=1Var[\xi] = \frac{n p_2 p_3 (p_2 + p_3) - 2 n p_2^2 p_3^2 - 2 p_2 p_3 \cdot Cov[O_2, O_3]}{n p_2 p_3 (p_2+p_3)} = \frac{n p_2 p_3 (p_2 + p_3) - \cancel{2 n p_2^2 p_3^2} - \cancel{2 p_2 p_3 \cdot (-n p_2 p_3)} }{n p_2 p_3 (p_2+p_3)} = 1

Thus, we’ve shown that our normal random variable ξ\xi has zero expectation and unit variance: ξN(0,1)\xi \sim \mathcal{N}(0, 1). Hence, ξ2χ12\xi^2 \sim \chi_1^2.

Statement 3. Independence of the second term and the first term

Finally, we need to show independence of the second term ξ2χ12\xi^2 \sim \chi_1^2 and the first term j=1k(Ojnpj)2npjχk12\sum \limits_{j=1}^{k} \frac{(O_j'-np_j)^2}{np_j} \sim \chi_{k-1}^2, so that their sum is distributed as χk2\chi_{k}^2.

Let me remind you the notation: here Oj=OjO_j'=O_j for j=1..k1j=1..k-1 and Oj=Oj+Oj+1O_j'=O_j + O_{j+1} for j=kj=k, so that the final component of the sum in the first term is (Ok+Ok+1n(pk+pk+1))2n(pk+pk+1)\frac{(O_k + O_{k+1} - n(p_k + p_{k+1}))^2}{n(p_k+p_{k+1})}.

For every j<kj < k (in case of the first step of induction k=2k=2 there’s only one j=1j=1) consider OjO_j and numerator (O2p3O3p2)(O_2p_3 - O_3p_2) of our ξ\xi:

Cov[Oj,O2p3O3p2]=E[Oj(O2p3O3p2)]E[Oj]E[O2p3O3p2]=npjp2p3+npjp3p2npjnp2p3+npjnp3p2=0Cov[O_j, O_2p_3 - O_3p_2] = \mathbb{E}[O_j (O_2p_3 - O_3p_2)] - \mathbb{E}[O_j] \mathbb{E}[O_2p_3 - O_3p_2] = -n p_j p_2 p_3 + n p_j p_3 p_2 - np_j np_2p_3 + np_j np_3 p_2 = 0

so that the first k1k-1 components (Ojnpj)2npj\frac{(O_j - np_j)^2}{np_j} of χk22\chi_{k-2}^2 are independent of our ξ2=(O2p3O3p2)2np2p3(p2+p3)\xi^2 = \frac{(O_2p_3 - O_3p_2)^2}{np_2p_3(p_2+p_3)}.

What’s left is to show independence of the final component (Ok+Ok+1n(pk+pk+1))2n(pk+pk+1)\frac{(O_k + O_{k+1} - n(p_k + p_{k+1}))^2}{n(p_k+p_{k+1})} of χk22\chi_{k-2}^2 from our ξ2=(O2p3O3p2)2np2p3(p2+p3)\xi^2 = \frac{(O_2p_3 - O_3p_2)^2}{np_2p_3(p_2+p_3)} (or just its numerator):

Cov[Ok+1+Ok,O2p3O3p2]=Cov[O3+O2,O2p3O3p2]=Cov[O3,O2p3O3p2]+Cov[O2,O2p3O3p2]=n2(p3p2p3p3(1p3)p2+p2(1p2)p3+p2p3p2)=n2(p32p2p3p2+p32p2+p2p3p22p3+p22p3)=0Cov[O_{k+1} + O_k, O_2p_3 - O_3p_2] = Cov[O_3 + O_2, O_2p_3 - O_3p_2] = Cov[O_3, O_2p_3 - O_3p_2] + Cov[O_2, O_2p_3 - O_3p_2] = n^2 (-p_3p_2p_3 - p_3(1-p_3)p_2 + p_2(1-p_2)p_3 + p_2 p_3 p_2) = n^2 (-{p_3}^2p_2 - p_3 p_2 + p_3^2 p_2 + p_2 p_3 - {p_2}^2p_3 + p_2^2 p_3) = 0.

This concludes the proof.


Derivation of Pearson’s goodness of fit test statistic follows the logic of the proof number 6 from this paper about 7 ways to prove Pearson’s test.

Many thanks to Ming Zhang, Justin Cen and RoyalDiscusser for finding errors in this post and suggesting improvements.


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram