cover

Student's t-distribution, t-test

June 20, 2021 15 min read

Here I discuss, how to derive Student's t-distribution, an important statistical distribution, used as a basis for t-test.

Student’s t-distribution is yet another important distribution, associated with Chi-squared distribution. For the derivation of chi-squared distribution see one of my previous posts.

William Sealy Gosset, known as “Student” used to work at the Guinness brewery and was interested in working with small datasets. He was not allowed to publish his findings under his real name, but was allowed to publish the resluts of his research under a pseudonym. Having been an attendant of Karl Pearson’s courses, he chose the fictional name “Student”.

Student’s t-distribution derivation

Suppose that you have sampled nn times from a normally distributed random variable ξN(μ,σ2)\xi \sim \mathcal{N}(\mu, \sigma^2), but you don’t know the mean and variance of that variable.

The best you can do is to substitute the unknown mean and variance with their unbiased sample estimates: sample mean ξˉ=i=1nξin\bar{\xi} = \frac{\sum \limits_{i=1}^n\xi_i}{n} and sample variance S2=1n1i=1n(ξiξˉ)2S^2 = \frac{1}{n-1} \sum \limits_{i=1}^{n} (\xi_i - \bar{\xi})^2.

It is intuitive to substitute sample mean and sample variance into the formula of normal distribution instead of the true ones. Turns out that we consider a very similar random variable T=ξˉμS2nT = \frac{\bar{\xi} - \mu }{\sqrt{ \frac{S^2}{n} }} (ratio between sample mean and square root of sample variance, normalized by nn), it is said to be t-Student distributed.

You may notice that the random variable T2=(ξˉμ)2S2/n=(ξˉμ)2σ2nS2σ2T^2 = \frac{ (\bar{\xi} - \mu)^2 }{S^2/n} = \frac{ \frac{ (\bar{\xi} - \mu)^2 }{ \frac{\sigma^2}{n} } }{ \frac{S^2}{\sigma^2} } looks very much like a ratio of two chi-squared-distributed random variables respectively. Therefore, T2T^2 would be a Fisher-Snedecor F-distributed random variable, if we managed to prove that the numerator and denominator were independent and that the denominator is chi-squared-distributed.

We shall look deeper into the properties of both of these estimators to find out more about them, as important properties arise from their analysis.

Sample mean and its distribution

Sample mean has a normal distribution μ^N(μ,σ2n)\hat{\mu} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n}). Let us show this fact:

Recall that by the rule of summation of normally distributed random variables sum of nn independent normally distributed random variables (not necessarily infinitely many!) is exactly normally distributed with mean nμn\mu and variance nσ2n\sigma^2: i=1nξiN(nμ,nσ2)\sum \limits_{i=1}^{n} \xi_i \sim \mathcal{N}(n\mu, n\sigma^2).

Therefore, the sample mean ξˉ=i=1nξinN(μ,σ2n)\bar\xi = \frac{\sum \limits_{i=1}^{n} \xi_i}{n} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n}) - because if Var(ξ)=σ2Var(\xi) = \sigma^2, Var(ξn)=σ2n2Var(\frac{\xi}{n}) = \frac{\sigma^2}{n^2}.

Hence, (ξˉμ)σnN(0,1)\frac{ (\bar{\xi} - \mu) }{ \frac{\sigma}{\sqrt{n}} } \sim \mathcal{N}(0, 1) is a standard normal random variable and its square (ξˉμ)2σ2nχ12\frac{ (\bar{\xi} - \mu)^2 }{ \frac{\sigma^2}{n} } \sim \chi_1^2 is a chi-square distributed variable with 1 degree of freedom.

Sample variance, unbiased (Bessel) estimator and its expectation

The best estimate of variance of a random variable that we can get from our experiment, is sample variance S2=1n1i=1n(ξiξˉ)2S^2 = \frac{1}{n-1} \sum \limits_{i=1}^{n} (\xi_i - \bar{\xi})^2.

Note that S2S^2 is normalized by n1n-1, not nn, which is not intuitive and called Bessel’s correction. It means that the naive sample variance for small samples is somewhat smaller than the exact variance (called distribution variance). To show this fact, let us follow the logic of this post from StackOverflow, which is very similar to derivation of Bias-Variance tradeoff in Machine Learning books, e.g. Hastie-Tibshirani.

Denote μ^\hat{\mu} the sample mean and μ\mu the true (distribution) mean.

Let us denote naive (biased) sample variance S^2=1ni=1n(ξiμ^)2\hat{S}^2 = \frac{1}{n} \sum \limits_{i=1}^{n} (\xi_i-\hat{\mu})^2 and unbiased sample variance S2=1n1i=1n(ξiμ^)2S^2 = \frac{1}{n-1} \sum \limits_{i=1}^{n} (\xi_i-\hat{\mu})^2.

Then expectation of the naive sample variance is:

ES^2=E(1ni=1n(ξiμ^)2)=E(1ni=1n(ξiμ+μμ^)2)=E(1ni=1n((ξiμ)2+2(ξiμ)(μμ^)+(μ^μ)2))=\mathbb{E}\hat{S}^2 = \mathbb{E}(\frac{1}{n}\sum \limits_{i=1}^{n} (\xi_i-\hat{\mu})^2) = \mathbb{E}(\frac{1}{n}\sum \limits_{i=1}^{n} (\xi_i - \mu + \mu -\hat{\mu})^2) = \mathbb{E}(\frac{1}{n}\sum \limits_{i=1}^{n} ( (\xi_i-\mu)^2 + 2(\xi_i - \mu)(\mu - \hat{\mu}) + (\hat{\mu} - \mu)^2) ) =

=E(1ni=1n((ξiμ)2)+2E(1ni=1n(ξiμ)(μμ^))+E(1ni=1n(μ^μ)2))= = \mathbb{E}(\frac{1}{n}\sum \limits_{i=1}^{n} ( (\xi_i-\mu)^2) + 2\mathbb{E}(\frac{1}{n}\sum \limits_{i=1}^{n}(\xi_i - \mu)(\mu - \hat{\mu})) + \mathbb{E}(\frac{1}{n}\sum \limits_{i=1}^{n}(\hat{\mu} - \mu)^2) ) =

=nnσ2E(2(μ^μ)ni=1n(ξiμ))+nσ2nn=σ2E(2(μ^μ)nn(μ^μ))+σ2n=σ22σ2n+σ2n=n1nσ2 = \frac{n}{n}\sigma^2 - \mathbb{E}(\frac{2(\hat{\mu} - \mu)}{n} \sum \limits_{i=1}^{n}(\xi_i-\mu)) + \frac{n \frac{\sigma^2}{n}}{n} = \sigma^2 - \mathbb{E}(\frac{2(\hat{\mu}-\mu)}{n} n(\hat{\mu}-\mu)) + \frac{\sigma^2}{n} = \sigma^2 - 2\frac{\sigma^2}{n} + \frac{\sigma^2}{n} = \frac{n-1}{n}\sigma^2.

Thus, the unbiased sample variance is nn1S^2=S2\frac{n}{n-1}\hat{S}^2 = S^2, so that E(nn1S^2)=E(S2)=σ2\mathbb{E}(\frac{n}{n-1}\hat{S}^2) = \mathbb{E}(S^2) = \sigma^2.

Sample variance consists of sum of squares of non-independent normal random variables

Now, what we are aiming to do is construct a ratio between squared sample mean and sample variance that would follow Fisher-Snedecor F-distribution, which is a ratio of two chi-squared-distributed random variables.

We want sample variance divided by exact variance S2σ2=1(n1)σ2i=1n(ξiξˉ)2 \frac{S^2}{\sigma^2} = \frac{1}{(n-1) \sigma^2} \sum \limits_{i=1}^{n} (\xi_i - \bar{\xi})^2 be a χ2\chi^2-distributed random variable.

It is tempting to assume that it is a sum of squares of nn standard normal variable and, thus, S2σ2\frac{S^2}{\sigma^2} would be χn2\chi_n^2-distributed - with nn degrees of freedom.

Indeed:

  • ξiN(μ,σ2)\xi_i \sim \mathcal{N}(\mu, \sigma^2)
  • ξˉ=i=1nξinN(μnσ2n2=σ2n)\bar{\xi} = \frac{\sum \limits_{i=1}^n \xi_i}{n} \sim \mathcal{N}(\mu \frac{\cancel{n}\sigma^2}{n^{\cancel{2}}}=\frac{\sigma^2}{n}) (variance of sum is sum of variances, variance of a variable divided by nn is variance divided by n2n^2)
  • ξiξˉ=n1nξijiξjnn1 itemsN(0,(n1)2n2σ2+(n1)σ2n2=(n1)(n1+1)n2σ2=n1nσ2)\xi_i - \bar{\xi} = \frac{n-1}{n}\xi_i - \underbrace{\sum \limits_{j \neq i} \frac{\xi_j}{n}}_{n-1 \text{ items}} \sim \mathcal{N}(0, \frac{(n-1)^2}{n^2}\sigma^2 + \frac{(n-1)\sigma^2}{n^2} = \frac{(n-1)(n-1+1)}{n^2} \sigma^2 = \frac{n-1}{n}\sigma^2) (variance of difference of iids is sum of variances)
  • ξiξˉn1σN(0,1n)\frac{\xi_i - \bar{\xi}}{\sqrt{n-1} \sigma} \sim \mathcal{N}(0, \frac{1}{n})

However, there is a huge problem: summands are normal variables, but NOT independent normal variables! E.g. if n=2n=2, they are exactly the opposite of each other, and number of degrees of freedom equals 1. If n=3n=3, two of them can take arbitrary values, but the third one is fixed. This sounds very much like the argument in Pearson’s goodness of fit test, right? Let us prove this one, too.

Sample variance is distributed as a chi-squared random variable with n-1 degrees of freedom

I am following the logic of this well-written article from PennState.

Ok, suppose that we knew the exact expectation μ\mu. Then let us construct the sum of squares of our samples:

i=1n(ξiμ)2σ2χn2\sum \limits_{i=1}^n \frac{(\xi_i - \mu)^2}{\sigma^2} \sim \chi^2_n

Again, let us add and subtract the sample mean to this sum of squares:

i=1n(ξiμ)2σ2=i=1n(ξiξˉ+ξˉμ)2σ2=i=1n((ξiξˉ)2σ2+2(ξiξˉ)(ξˉμ)σ20 due to i=1n(ξiξˉ)=0+(ξˉμ)2σ2)=(n1)S2σ2+n(ξˉμ)2σ2\sum \limits_{i=1}^n \frac{(\xi_i - \mu)^2}{\sigma^2} = \sum \limits_{i=1}^n \frac{(\xi_i - \bar{\xi} + \bar{\xi} - \mu)^2}{\sigma^2} = \sum \limits_{i=1}^n (\frac{(\xi_i-\bar{\xi})^2}{\sigma^2} + \underbrace{2 \frac{(\xi_i - \bar{\xi})(\bar{\xi} - \mu)}{\sigma^2}}_{0 \text{ due to }\sum \limits_{i=1}^n (\xi_i - \bar{\xi}) = 0} + \frac{(\bar{\xi} - \mu)^2}{\sigma^2}) = (n-1)\frac{S^2}{\sigma^2} + n\frac{(\bar{\xi} - \mu)^2}{\sigma^2} or:

i=1n(ξiμ)2σ2=(n1)S2σ2+n(ξˉμ)2σ2\sum \limits_{i=1}^n \frac{(\xi_i - \mu)^2}{\sigma^2} = (n-1)\frac{S^2}{\sigma^2} + n\frac{(\bar{\xi} - \mu)^2}{\sigma^2}, where i=1n(ξiμ)2σ2χn2\sum \limits_{i=1}^n \frac{(\xi_i - \mu)^2}{\sigma^2} \sim \chi^2_n, n(ξˉμ)2σ2χ12n\frac{(\bar{\xi} - \mu)^2}{\sigma^2} \sim \chi^2_1.

By Cochran’s theorem sample variance S2S^2 is independent of sample mean ξˉ\bar{\xi}, thus, probability density function of i=1n(ξiμ)2σ2\sum \limits_{i=1}^n \frac{(\xi_i - \mu)^2}{\sigma^2} is a convolution of probability density functions of (n1)S2σ2(n-1)\frac{S^2}{\sigma^2} and n(ξˉμ)2σ2n\frac{(\bar{\xi} - \mu)^2}{\sigma^2}.

Now, we can directly use the convolution formula or apply one of spectral analysis tools to it to derive the distribution of S2S^2, moment-generating functions/cumulants or characteristic functions/Fourier transform.

Fourier transform of a convolution is a multiple of Fourier transforms. Thus, ϕi=1n(ξiμ)2σ2(t)=ϕ(n1)S2σ2(t)ϕn(ξˉμ)2σ2(t)\phi_{\sum \limits_{i=1}^n \frac{(\xi_i - \mu)^2}{\sigma^2}}(t) = \phi_{(n-1)\frac{S^2}{\sigma^2}}(t) \cdot \phi_{-n\frac{(\bar{\xi} - \mu)^2}{\sigma^2}}(t).

Characteristic function of a chi-squared distribution is ϕχn2(t)=(12it)n2\phi_{\chi^2_n}(t) = (1-2it)^{-\frac{n}{2}}.

Thus, characteristic function ϕ(n1)S2σ2(t)=(12it)n2(12it)12=(12it)n12\phi_{(n-1)\frac{S^2}{\sigma^2}}(t) = (1-2it)^{\frac{-n}{2}} \cdot (1-2it)^{\frac{1}{2}} = (1-2it)^{-\frac{n-1}{2}}. But this is the characteristic function of χn12\chi^2_{n-1} (characteristic functions are mostly reversible, so that correspondence of characteristic functions implies correspondence of distributions).

Hence, (n1)S2σ2χn12(n-1)\frac{S^2}{\sigma^2} \sim \chi^2_{n-1}.

Cochran’s theorem: Independence of sample mean and sample variance

Moreover, it is not obvious that our numerator (sample mean) and denominator (sample variance) are independent as well. To deal with these problems, we need one more tool in our pocket.

A general argument, called Cochran’s theorem, exists, that can be used to prove independence of these two.

I will consider Cochran’s theorem in detail it in the next post.

t-statistic distribution derivation from F-distribution

Let us derive the t-Student distribution from Fisher-Snedecor’s F.

We know that FT2(x)=p(T2x)=p(xTx)=FT(x)FT(x)F_{T^2}(x) = p(T^2 \leq x) = p(-\sqrt{x} \leq T \leq \sqrt{x}) = F_T(\sqrt{x}) - F_T(\sqrt{-x}).

Hence, differentiating, we get: fT2(x)=F(x)xxxF(x)(x)xx=fT(x)2x+fT(x)2x f_{T^2}(x) = \frac{\partial F(\sqrt{x})}{\partial \sqrt{x}} \frac{\partial \sqrt{x}}{\partial x} - \frac{\partial F(-\sqrt{x})}{\partial (-\sqrt{x})} \frac{\partial -\sqrt{x}}{\partial x} = \frac{f_T(\sqrt{x})}{2\sqrt{x}} + \frac{f_T(-\sqrt{x})}{2\sqrt{x}}.

Probability density function of T is symmetric, since the underlying distributions are symmetric.

Thus, fT2(x)=fT(x)2x+fT(x)2x=fT(x)xf_{T^2}(x) = \frac{f_T(\sqrt{x})}{2\sqrt{x}} + \frac{f_T(-\sqrt{x})}{2\sqrt{x}} = \frac{f_T(\sqrt{x})}{\sqrt{x}} or fT(x)=xfT2(x2)f_T(x) = |x|f_{T^2}(x^2).

Substituting probability density function of F distribution with 1 and n-1 degrees of freedom into the last formula, we obtain the t-Student distribution probability density function with (n-1) degrees of freedom:

fT(x)=Γ(n2)Γ(n12)Γ(12)(1n1)1/2(x2)1/21(1+1n1x2)n/2=Γ(n2)Γ(n12)(n1)π(1+x2n1)n/2f_T(x) = \frac{\Gamma(\frac{n}{2})}{\Gamma(\frac{n-1}{2}) \Gamma(\frac{1}{2})} \frac{ (\frac{1}{n-1})^{1/2} (x^2)^{1/2-1} }{ (1+\frac{1}{n-1}x^2)^{n/2} } = \frac{\Gamma(\frac{n}{2})}{\Gamma(\frac{n-1}{2}) \sqrt{(n-1)\pi} (1 + \frac{x^2}{n-1})^{n/2}}.

In this section I was following the logic of this post.

Student’s t-test

Student’s t-test is a family of statistical tests, based on application of t-distribution.

Paired data and unpaired data

Suppose that you have pairs of data from 2 measurements, e.g. same person’s temperature without treatment and with treatment.

Our null-hypothesis is that the treatment doesn’t work. Then a random variable:

T=mean1mean2S(mean1mean2)nT = \frac{mean1 - mean2}{\frac{S(mean1 - mean2)}{\sqrt{n}}}

is supposed to be t-Student distributed. We will reject the null-hypothesis, is the p-value for the obtained value of T is sufficiently low.

TODO: unpaired data

Equal and unequal variance

TODO

Confidence intervals estimation

t-Student’s distribution can be helpful for estimation of confidence intervals (see wikipedia 1 and 2) for the estimate of the mean.

Suppose that we need to calculate the range of reasonably probably values of mean μˉ\bar{\mu} of our normal distribution.

Pick a value A that corresponds to the probability level of 90% or 95% of t-Student distribution: p(-A < T < A) = 0.9. Using this level A we can calculate the confidence interval for μ\mu:

p(A<T<A)=p(A<ξˉμSn<A)p(-A < T < A) = p(-A < \frac{\bar{\xi} - \mu}{\frac{S}{\sqrt{n} }} < A), thus, p(ξˉASn<μ<ξˉ+ASn)p(\bar{\xi} - A\frac{S}{\sqrt{n}} < \mu < \bar{\xi} + A\frac{S}{\sqrt{n}})

So, our confidence interval for μ\mu is μ\mu \in [ ξˉASn\bar{\xi} - A\frac{S}{\sqrt{n}}, ξˉ+ASn\bar{\xi} + A\frac{S}{\sqrt{n}} ].


Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram