Intro to the Extreme Value Theory and Extreme Value Distribution

April 30, 2023 176 min read

Quite often in mathematical statistics I run into Extreme Value Distribution - an analogue of Central Limit Theorem, which describes the distribution of maximum/minimum, observed in a series of i.i.d random variable tosses. This is an introductory text with the basic concepts and proofs of results from extreme value theory, such as Generalized Extreme Value and Pareto distributions, Fisher-Tippett-Gnedenko theorem, von Mises conditions, Pickands-Balkema-de Haan theorem and their applications.


  1. Problem statement and Generalized Extreme Value distribution
    • Type I: Gumbel distribution
    • Type II: Frechet distribution
    • Type III: Inverse Weibull distribution
  2. Fisher-Tippett-Gnedenko theorem
    • General approach: max-stable distributions as invariants/fixed points/attractors and EVD types as equivalence classes
    • Khinchin’s theorem (Law of Convergence of Types)
    • Necessary conditions of maximium stability
    • Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)
    • Distributions not in domains of attraction of any maximum-stable distributions
  3. Von Mises sufficient conditions for a distribution to belong to a type I, II or III
    • Pre-requisites from survival analysis
    • Von Mises conditions proof
    • Generalizations of von Mises condition for Type I EVD: auxiliary function and von Mises function
  4. Necessary and sufficient conditions for a distribution to belong to a type I, II or III
    • Pre-requisites from Karamata’s theory of slow/regular/Г-/П- variation
    • Necessary and sufficient conditions of convergence to Types II or III EVD
    • Necessary and sufficient conditions of convergence to Type I EVD
  5. Residual life time
    • Generalized Pareto distribution
    • Residual life time problem
    • Pickands-Balkema-de Haan theorem (a.k.a. Second Extreme Value Theorem)
  6. Order statistics and parameter estimation
    • Order statistics
    • Hill’s estimator
    • Pickands’ estimator
    • Other estimators
  7. Summary and examples of practical application
    • Examples of Type I Gumbel distribution
    • Examples of Type II Frechet distribution
    • Examples of Type III Inverse Weibull distribution
  8. Concluding remarks

1. Problem statement and Generalized Extreme Value distribution

One of the most famous results in probabilities is Central Limit Theorem, which claims that sum of nn \to \infty i.i.d. random variables ξi\xi_i after centering and normalizing converges to Gaussian distribution.

Now, what if we ask a similar question about maximum of those nn \to \infty i.i.d. random variables instead of sum? Does it converge to any distribution?

Turns out that it depends on the properties of the distribution ξi\xi_i, but not much really. Regardless of the distribution of ξi\xi_i the distribution of maximum of nn random variables ξi\xi_i is:

Gγ(x)=exp((1+γx)1γ)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}})

This distribution is called Generalized Extreme Value Distribution. Depending on the coefficient γ\gamma it can take one of three specific forms:

Type I: Gumbel distribution

If γ0\gamma \to 0, we can assume that k=1γk = \frac{1}{\gamma} \to \infty. Then generalized EVD converges to a doubly-exponential distribution (sometimes this is called a law of double logarithm) by definition of e=(1+1k)ke = (1 + \frac{1}{k})^k and ex=(1+1kx)ke^x = (1 + \frac{1}{k}x)^k:

Gγ(x)=exp((1+γx)1γ)=exp((1+1kx)k)=exp(ex)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(1 + \frac{1}{k} x)^{-k}) = exp(-e^{-x}).

This is Gumbel distribution, it oftentimes occurs in various areas, e.g. bioinformatics, describing the distribution of longest series of successes in coin tosses in nn experiments of tossing a coin 100 times.

It is often parametrized by scale and center parameters. I will keep it centered here, but will add shape parameter λ\lambda:

F(x)=eexλF(x) = e^{-e^{-\frac{x}{\lambda}}}, or, in a more intuitive notation F(x)=1eex/λF(x) = \frac{1}{\sqrt[e^{x/\lambda}]{e}}.

It is straightforward to derive probability density function f(x)f(x) from here:

f(x)=Fx=exλ(1λ)eexλ=1λexλ+exλf(x) = \frac{\partial F}{\partial x} = -e^{-\frac{x}{\lambda}} \cdot (-\frac{1}{\lambda}) \cdot e^{-e^{-\frac{x}{\lambda}}} = \frac{1}{\lambda} e^{- \frac{x}{\lambda} + e^{-\frac{x}{\lambda}}}.

import math

import numpy as np
import matplotlib.pyplot as plt

scale = 1

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(-20, 20, 0.1)

# Calculate y values
gumbel_cdf = math.e**(-math.e**(-(x/scale)))
gumbel_pdf = (1 / scale) * np.exp(-( x/scale + math.e**(-(x / scale))))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, gumbel_cdf, label='cdf')

# Plot pdf
ax.plot(x, gumbel_pdf, label='pdf')

# Set up the legend

# Set up the labels and title
ax.set_title('Plot of Gumbel pdf and cdf')

# Display the plot

Gumbel distribution plot

Gumbel distribution plot, scale = 1.

Type II: Frechet distribution

If γ>0\gamma > 0, let us denote k=1γk = \frac{1}{\gamma} (k > 0), y=λ(1+γx)y = \lambda \cdot (1 + \gamma x), where kk is called shape parameter and λ\lambda - scale parameter. Then distribution takes the shape:

Gγ(x)=exp((1+γx)1γ)=exp((yλ)k)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(\frac{y}{\lambda})^{-k}).

To make it more intuitive, I’ll re-write cdf in the following way: F(x)=1e(λx)kF(x) = \frac{1}{e^{(\frac{\lambda}{x})^k}}.

This is Frechet distribution. It arises when the tails of the original cumulative distribution function Fξ(x)F_{\xi}(x) are heavy, e.g. when it is Pareto distribution.

Let us derive the probability density function for it:

f(x)=Fx=(xλ)k1(k)1λe(xλ)k=kλ(xλ)k1e(xλ)kf(x) = \frac{\partial F}{\partial x} = - (\frac{x}{\lambda})^{-k - 1} \cdot (-k) \cdot \frac{1}{\lambda} \cdot e^{-(\frac{x}{\lambda})^{-k}} = \frac{k}{\lambda} \cdot (\frac{x}{\lambda})^{-k-1} \cdot e^{-(\frac{x}{\lambda})^{-k} }.

Here is the plot:

import math

import numpy as np
import matplotlib.pyplot as plt

shape = 2  # alpha
scale = 2  # beta

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(0, 20, 0.1)

# Calculate y values
frechet_cdf = math.e**(-(scale / x) ** shape)
frechet_pdf = (shape / scale) * ((scale / x) ** (shape + 1)) * np.exp(-((scale / x) ** shape))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, frechet_cdf, label='cdf')

# Plot pdf
ax.plot(x, frechet_pdf, label='pdf')

# Set up the legend

# Set up the labels and title
ax.set_title('Plot of Frechet distribution pdf and cdf')

# Display the plot

Frechet distribution plot

Frechet distribution plot, scale = 1, shape = 1.

Type III: Inverse Weibull distribution

If γ<0\gamma < 0, let us denote k=1γk = -\frac{1}{\gamma} (k > 0, different kinds of behaviour are observed at 0<k<10 < k < 1, k=1k = 1 and k>1k > 1), y=λ(1+γx)y = \lambda (1 + \gamma x).

Then distribution takes the shape:

Gγ(x)=exp((1+γx)1γ)=exp((yλ)k)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(\frac{y}{\lambda})^k).

Gγ(x)={exp((xλ)k),x01,x>0G_{\gamma}(x) = \begin{cases} exp(-(\frac{x}{\lambda})^{k}), x \le 0 \\ 1, x > 0 \end{cases}.

This is Inverse Weibull distribution. Its direct counterpart (Weibull distribution) often occurs in survival analysis as a hazard rate function. It also arises in mining - there it describes the mass distribution of particles of size xx and is closely connected to Pareto distribution. We shall discuss this connection later.

Generalized extreme value distribution converges to Inverse Weibull, when distribution of our random variable ξ\xi is bounded. E.g. consider uniform distribution ξU(0,1)\xi \sim U(0, 1). It is clear that the maximum of nn uniformly distributed variables will be approaching 1 as nn \to \infty. Turns out that the convergence rate is described by Inverse Weibull distribution.

To make it more intuitive, we can re-write the cdf as F(x)={1e(xλ)k,x01,x>0F(x) = \begin{cases} \frac{1}{e^{ (\frac{-x}{\lambda})^k }}, x \le 0 \\ 1, x > 0 \end{cases}.

Derive from cumulative distribution function F(x)=exp((xλ)k)F(x) = exp(-(\frac{-x}{\lambda})^{k}) the probability density function:

f(x)=Fx=(xλ)k1kλexp((xλ)k)=kλ(xλ)k1exp((xλ)k)f(x) = \frac{\partial F}{\partial x} = -(\frac{-x}{\lambda})^{k-1} \cdot \frac{-k}{\lambda} \cdot exp(-(\frac{-x}{\lambda})^{k}) = \frac{k}{\lambda} \cdot (\frac{-x}{\lambda})^{k-1} \cdot exp(-(\frac{-x}{\lambda})^{k}).

Let us draw the plot:

import math

import numpy as np
import matplotlib.pyplot as plt

shape = 2  # alpha
scale = 2  # beta

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(-20, 0, 0.1)

# Calculate y values
inverse_weibull_cdf = math.e**(-(-x/scale) ** shape)
inverse_weibull_pdf = (shape / scale) * ((-x / scale) ** (shape - 1)) * np.exp(-((-x / scale) ** shape))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, inverse_weibull_cdf, label='cdf')

# Plot pdf
ax.plot(x, inverse_weibull_pdf, label='pdf')

# Set up the legend

# Set up the labels and title
ax.set_title('Plot of Inverse Weibull pdf and cdf')

# Display the plot

Inverse Weibull plot

Plot of Inverse Weibull distribution, shape = 2, scale = 2.

2. Fisher-Tippett-Gnedenko theorem

Extreme Value Theorem is a series of theorems, proven in the first half of 20-th century. They claim that maximum of several tosses of i.i.d. random variables converges to just one of 3 possible distributions, Gumbel, Frechet or Weibull.

Here I will lay out the outline of the proof with my comments. The proof includes introduction of several technical tools, but I will comment on their function and rationale behind each of them.

Consider a random variable MnM_n, which describes the distribution of maximum of ξi\xi_i, \in 1..n

p(Mnx)=i=1np(ξix)=Fn(ξix)p(M_n \le x) = \prod \limits_{i=1}^{n} p(\xi_i \le x) = F^n(\xi_i \le x).

Similarly to the Central Limit Theorem, a convergence theorem might be applicable to the distribution of a normalized random variable MnM_n rather than the non-normalized:

p(Mnbnanx)=p(Mnanx+bn)=Fn(anx+bn)p(\frac{M_n - b_n}{a_n} \le x) = p(M_n \le a_n x + b_n) = F^n(a_n x + b_n)

We aim to show that for some series of constants aia_i and bib_i

Fn(anx+bn)F^n(a_n x + b_n) as nn \to \infty converges in distribution to some distribution G(x)G(x): Fn(anx+bn)nwG(x)F^n(a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

Now I will informally describe the proof outline, before introducing the mathematical formalism.

General approach: max-stable distributions as invariants/fixed points/attractors and EVD types as equivalence classes

I assume that all three types of Extreme Value Distribution were first discovered experimentally. Later statisticians came up with a proof that EVD can converge to just one of three possible types of distributions and no other types of EVD can exist. Finally, they came up with criteria for a distribution to belong to each type.

Design of this proof is similar to many other proofs. I will outline it informally here:

Assume that as the number of random variables nn \to \infty increases, approaching infinity, the distribution of the observed maximum approaches some type of distribution. Then such a distribution type can be considered as an invariant or attractor or fixed point, similar to many other mathematical problems. For instance, eigenvectors are fixed points of matrix multiplication. E.g. matrix eigenvector, multiplied by a matrix, results in itself, multiplied by a scalar. Or no matter how many times you take a derivative of ekxe^{kx}, you get ekxe^{kx}, multiplied by a scalar kk.

Similarly, maximum-stable distributions are invariant objects. Those are distributions, maximum of i.i.d. variables of which converges to themselves, no matter how many more i.i.d. random variables you toss. E.g. if for one Gumbel-distributed random variable ξ\xi we know that pξ(M1b1a1x)=eexp_{\xi}(\frac{M_1 - b_1}{a_1} \le x) = e^{-e^{-x}}, for nn \to \infty Gumbel-distributed random variables the maximum of ξ1..ξn\xi_1.. \xi_n still is Gumbel-distributed (after centering and normalizing them by some numbers ana_n, bnb_n): pMn(Mnbnanx)=eexp_{M_{n}}(\frac{M_n - b_n}{a_n} \le x) = e^{-e^{-x}}.

Ok. Then after we established that there are some distributions, for which maximum of nn \to \infty centered and normalized i.i.d. variables produces a random variable with the same distribution, how do we show that all distributions converge to one of them?

We’ll use another classical mathematical tool: equivalence classes and equivalence relation. For instance, odd numbers and even numbers form two equivalence classes under operation of modulo 2. Odd numbers are equivalent to each other in terms of producing remainder 1 (e.g. 353 \sim 5, where \sim is equivalence relation of modulo 2), and even numbers are equivalent in terms of producing remainder 0.

Similarly, we will show that types of EVD form equivalence classes under the operation of finding maximum of nn \to \infty i.i.d. random variables with any distribution, and as a result all the distributions converge to one of those types. E.g. Pareto’s distribution is equivalent to Cauchy distribution under equivalence relation of convergence of maximum of nn \to \infty Pareto/Cauchy i.i.d’s to the same maximum stable type II (Frechet) EVD.

Now that I’ve laid out the plan of the proof, it is time to get into technicalities. I will formally introduce the concepts I mentioned above and prove some lemmas about their relatedness.

Definition 2.1: Max-stable cumulative distribution function

GG is max-stable if for all n1..Nn \in 1..N and for all x there exists {an},{bn}R+\{a_n\}, \{b_n\} \subset \mathbb{R}^+ such that for all xRx \in \mathbb{R} G(x)=Gn(anx+bn)G(x) = G_n(a_n x + b_n).

Definition 2.2: Domain of attraction

If FF is a cdf, then FF is in the domain of attraction (for maxima) of GG, and it is written FD(G)F \in \mathcal{D}(G), when there exist sequences {an},{bn}R+\{a_n\}, \{b_n\} \subset \mathbb{R}^+ such that Fn(anx+bn)nwG(x)F^n (a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

Definition 2.3: Type of convergence

If G(x)G^*(x) is another non-degenerate cdf, we say that GG and GG^* have the same type if for all xx there exist a>0a > 0 and bRb \in R such that for every x ∈ R G(ax+b)=G(x)G^*(ax + b) = G(x).

Khinchin’s theorem (Law of Convergence of Types)


Aleksandr Yakovlevich Khinchin

Lemma 2.1: Khinchin’s theorem (law of Convergence of Types)

Suppose that we have a sequence of distribution functions {Fn}\{F_n\} (e.g. the distributions of maximum of random variable ξi\xi_i in nn experiments).

Let those distribution functions upon nn \to \infty converge to a certain distribution G(x)G(x): Fn(anx+bn)nwG(x)F_n(a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x). Then we have two series of constants {an},{bn}\{a_n\}, \{b_n\}.

Suppose there is another distribution function H(x)H(x) such that the sequence of distributions Fn(αnx+βn)F_n(\alpha_n x + \beta_n) converges to that function: Fn(αnx+βn)nwH(x)F_n(\alpha_n x + \beta_n) \xrightarrow[n \to \infty]{w} H(x) and there is a different pair of series {αn},{βn}\{ \alpha_n \}, \{\beta_n \}.

Then H(x)=G(Ax+B)H(x) = G(Ax + B) and A=αnanA = \frac{\alpha_n}{a_n}, B=βnbnanB = \frac{\beta_n - b_n}{a_n}.


Consider two distribution functions G(x)G(x) and H(x)H(x), such that for every xx: y=F(ax+b)y = F(ax+b) and y=F(αx+β)y = F(\alpha x + \beta).

Denote y=F(ax+b)G(x)y = F(ax + b) \to G(x). Then F1(y)=ax+bF^{-1}(y) = ax + b and x=F1(y)baG1(y)x = \frac{F^{-1}(y) - b}{a} \to G^{-1}(y).

Similarly y=F(αx+β)H(x)y = F(\alpha x + \beta) \to H(x) and F1(y)=αx+βF^{-1}(y) = \alpha x + \beta and x=F1(y)βαH1(y)x = \frac{F^{-1}(y) - \beta}{\alpha} \to H^{-1}(y).

Now choose two points: x1x_1, corresponding to y1y_1, and x2x_2, corresponding to y2y_2 and subtract x1x_1 and x2x_2 from each other:

x1x2=F1(y1)F1(y2)aG1(y1)G1(y2)x_1 - x_2 = \frac{F^{-1}(y_1) - F^{-1}(y_2)}{a} \to G^{-1}(y_1) - G^{-1}(y_2)

Apply the same for H1H^{-1}:

x1x2=F1(y1)F1(y2)αH1(y1)H1(y2)x_1 - x_2 = \frac{F^{-1}(y_1) - F^{-1}(y_2)}{\alpha} \to H^{-1}(y_1) - H^{-1}(y_2)

Which results in G1(y1)G1(y2)H1(y1)H1(y2)αa=A\frac{G^{-1}(y_1) - G^{-1}(y_2)}{H^{-1}(y_1) - H^{-1}(y_2)} \to \frac{\alpha}{a} = A.

Substitute α=Aa\alpha = A \cdot a into H1(y)x=F1(y)βAaH^{-1}(y) \to x = \frac{F^{-1}(y) - \beta}{A \cdot a} and AH1(y)Ax=F1(y)βaA \cdot H^{-1}(y) \to A \cdot x = \frac{F^{-1}(y) - \beta}{a}.

On the other hand we recall that G1(y)x=F1(y)baG^{-1}(y) \to x = \frac{F^{-1}(y) - b}{a}. Subtracting these, we get: AH1(y)G1(y)F1(y)βaF1(y)ba=bβaA \cdot H^{-1}(y) - G^{-1}(y) \to \frac{F^{-1}(y) - \beta}{a} - \frac{F^{-1}(y) - b}{a} = \frac{b - \beta}{a} or βba=BG1(y)AH1(y)\frac{\beta - b}{a} = B \to G^{-1}(y) - A \cdot H^{-1}(y).

Hence, G1(y)AH1(y)+BG^{-1}(y) \to A \cdot H^{-1}(y) + B.

Lemma 2.2: Necessary condition of maximum-stability

Given G a non-degenerate cdf:

  1. G is max-stable if and only if there exists a sequence {Fn}\{F_n\} of cdf ’s and sequences

{an}R+\{a_n\} \subset \mathbb{R}^+, {bn}\{b_n\} such that for all kNk \in N Fn(ankx+bnk)nwG1/k(x)F_n(a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

  1. D(G)0\mathcal{D}(G) \neq 0 if and only if GG is max-stable. In that case, GD(G)G \in \mathcal{D}(G).


Proposition 1 direct statement: if GG is max-stable, there exists {Fn}\{F_n\} such that …

If GG is max-stable, then by definition for every nNn \in \mathbb{N} there exist ana_n, bnb_n, such that Gn(anx+bn)=G(x)G^{n}(a_n x + b_n) = G(x).

Define Fn=GnF_n = G^n. Then Fnk(ankx+bnk)=Gnk(ankx+bnk)=GF^k_n(a_{nk} x + b_{nk}) = G^{nk}(a_{nk} x + b_{nk}) = G. We arrive at the direct statement.

Proposition 1 reverse statement: if GG is max-stable, there exists {Fn}\{F_n\} such that …

Let us proof the reverse statement: suppose that the sequences {Fn}\{F^n\}, {an}\{a_n\}, {bn}\{b_n\} exist, such that for all kNk \in \mathbb{N}:

Fn(ankx+bnk)nwG1/k(x)F_n(a_{nk}x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

Then consider k=1k=1 and k=2k=2:

Fn(anx+bn)nwG(x)F_n(a_{n}x + b_{n}) \xrightarrow[n \to \infty]{w} G(x) and Fn(a2nx+b2n)nwG1/2(x)F_n(a_{2n}x + b_{2n}) \xrightarrow[n \to \infty]{w} G^{1/2}(x)

By Khinchin’s lemma there exists G(α2x+β2)=G1/2(x)G(\alpha_2 x+ \beta_2) = G^{1/2}(x).

Similarly, for every other kk: G(αkx+βk)=G1/k(x)G(\alpha_k x + \beta_k) = G^{1/k}(x) or Gk(αkx+βk)=G(x)G^k(\alpha_k x + \beta_k) = G(x), which is the definition of max-stability.

Proposition 2 direct statement:

The proof is self-evident: if G is max-stable, Gn(anx+bn)=G(x)G^n(a_n x + b_n) = G(x), and GD(G)G \in \mathcal{D}(G) by defintion.

Proposition 2 reverse statement:

Assume FD(G)F \in \mathcal{D}(G), i.e. Fn(anx+bn)nwG(x)F^n (a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

For all kNk \in \mathbb{N} we have Fnk(ankx+bnk)nwG(x)F^{nk} (a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G(x).

Hence, Fn(ankx+bnk)nwG1/k(x)F^{n} (a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

This makes GG and GkG^k fit for the conditions of previous result, proving that GG is max-stable.

Corollary 2.1:

Let GG be a max-stable cdf. Then there exist functions a(s)>0a(s) > 0 and b(s)b(s) such that for all xRx \in \mathbb{R}, for all s>0s > 0, Gs(a(s)x+b(s))=G(x)G^s(a(s)x + b(s)) = G(x).

Corollary is self-evident from inversion of indices s=1ks = \frac{1}{k}.

Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)

Sir Ronald Aylmer Fisher Leonard Henry Caleb Tippett Boris Vladimirovich Gnedenko
Fisher Tippett Gnedenko

Theorem 2.1: Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)

Let ξi\xi_i be a sequence of i.i.d. random variables.

If there exist constants an>0a_n > 0, bnRb_n \in \mathbb{R} and some non-degenerate cumulative distribution function GG such that MnbnanG\frac{M_n - b_n}{a_n} \sim G, then GG is one of these:

(Type I) Gumbel: G(x)=exp(ex)G(x) = exp(-e^{-x}), xRx \in \mathbb{R},

(Type II) Frechet: G(x)=exp(xα)G(x) = exp(-x^{-\alpha}), x0,α>0x \ge 0, \alpha > 0,

(Type III) Inverse Weibull: G(x)=exp((x)α)G(x) = exp(-(-x)^{\alpha}), x0,α>0x \le 0, \alpha > 0.


Here we give the proof of Fisher-Tippett-Gnedenko theorem without introducing any additional pre-requisites and intermediate constructs. Because of that it might look like black magic now. It is not clear, how anyone could’ve come up with this proof.

However, later on in parts 3 and 4 we will give the definitions of tail quantile function and tools from Karamata’s theory of slow/regular variation.

If you revisit this proof afterwards, you will notice that we’re making use of those tools, without naming them explicitly.

Step 1.

Consider double negative logarithm of max-stable distribution G(a(s)x+b(s))s=G(x)G(a(s)x + b(s))^s = G(x).

ln(ln(G(a(s)x+b(s))s))=ln(sln(G(a(s)x+b(s))))=ln(ln(G(a(s)x+b(s))))lns=ln(lnG(x))-\ln(-\ln(G(a(s)x + b(s))^{s})) = -\ln( -s \cdot \ln(G(a(s)x + b(s)))) = -\ln(-\ln(G(a(s)x + b(s)))) - \ln s = -\ln(-\ln G(x))

Step 2.

Denote ϕ(x)=ln(ln(G(x)))\phi(x) = -\ln(-\ln(G(x))). Then from previous ϕ(a(s)x+b(s))lns=ϕ(x)\phi(a(s)x + b(s)) - \ln s = \phi(x).

Step 3.

Denote y=ϕ(x)y = \phi(x). Apply ϕ1\phi^{-1} to both sides. We get: ϕ1(ϕ(a(s)x+b(s)))=y+lns\phi^{-1}(\phi(a(s)x + b(s))) = y + \ln s.

a(s)x+b(s)=ϕ1(y+lns)a(s)x + b(s) = \phi^{-1}(y + \ln s)

a(s)ϕ1(y)+b(s)=ϕ1(y+lns)a(s) \phi^{-1}(y) + b(s) = \phi^{-1}(y + \ln s)

ϕ1(y)=ϕ1(y+lns)b(s)a(s)\phi^{-1}(y) = \frac{\phi^{-1}(y + \ln s) - b(s)}{a(s)}

Step 4.

Note that ϕ1(0)=ϕ1(lns)b(s)a(s)\phi^{-1}(0) = \frac{\phi^{-1}(\ln s) - b(s)}{a(s)}. Subtract ϕ1(0)\phi^{-1}(0) from both sides:

ϕ1(y)ϕ1(0)=ϕ1(y+lns)b(s)a(s)ϕ1(lns)b(s)a(s)=ϕ1(y+lns)ϕ1(lns)a(s)\phi^{-1}(y) - \phi^{-1}(0) = \frac{\phi^{-1}(y + \ln s) - b(s)}{a(s)} - \frac{\phi^{-1}(\ln s) - b(s)}{a(s)} = \frac{\phi^{-1}(y + \ln s) - \phi^{-1}(\ln s)}{a(s)}

Step 5.

Substitute variables: ψ1(y)=ϕ1(y)ϕ1(0)\psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0), z=lnsz = \ln s, a~(z)=a(ez)\tilde a(z) = a(e^z). Then:

ψ1(y)=ϕ1(y)ϕ1(0)=ϕ1(y+lns)ϕ1(lns)a(s)=ψ1(y+z)ψ1(z)a~(z)\psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0) = \frac{\phi^{-1}(y + \ln s) - \phi^{-1}(\ln s)}{a(s)} = \frac{\psi^{-1}(y + z) - \psi^{-1}(z)}{\tilde a(z)}

ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z)

Step 6.

We can swap yy and zz in previous equation, settings y=zy = z and z=yz = y:

ψ1(y+z)ψ1(y)=ψ1(z)a~(y)\psi^{-1}(y + z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y)

After that subtract ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z) from ψ1(y+z)ψ1(y)=ψ1(z)a~(y)\psi^{-1}(y + z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y):

ψ1(z)ψ1(y)=ψ1(z)a~(y)ψ1(y)a~(z)\psi^{-1}(z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y) - \psi^{-1}(y) \tilde a(z)

ψ1(z)(1a~(y))=ψ1(y)(1a~(z))\psi^{-1}(z) (1 - \tilde a(y)) = \psi^{-1}(y) (1 - \tilde a(z))

Here we consider two cases.

Step 7a.

If a~(z)=1\tilde{a}(z) = 1, previous equation leads us to 0=00 = 0. But then let’s substitute a~(z)=1\tilde{a}(z) = 1 into the result of step 5:

ψ1(y+z)=ψ1(y)+ψ1(z)\psi^{-1}(y + z) = \psi^{-1}(y) + \psi^{-1}(z)

This means that ψ1(y)=ρy\psi^{-1}(y) = \rho y and denoting ν=ϕ1(0)\nu = \phi^{-1}(0), we get:

ρy=ψ1(y)=ϕ1(y)ϕ1(0)=ϕ1(y)ν\rho y = \psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0) = \phi^{-1}(y) - \nu

ϕ1(y)=ν+ρy\phi^{-1}(y) = \nu + \rho y

x=ϕ1(ϕ(x))=ν+ρln(ln(G(x)))x = \phi^{-1}(\phi(x)) = \nu + \rho \ln(-\ln(-G(x)))

G(x)=exp(exνρ)G(x) = exp(-e^{-\frac{x - \nu}{\rho}}), which is Gumbel (Type I) EVD.

Step 7b.

If a~(z)1\tilde{a}(z) \ne 1:

ψ1(y)=ψ1(z)(1a~(z))(1a~(y))=c(1a~(y))\psi^{-1}(y) = \frac{ \psi^{-1}(z) }{ (1 - \tilde a(z)) } (1 - \tilde a(y)) = c (1 - \tilde a(y))

Now recall that ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z) and substitute ψ1(y)=c(1a~(y))\psi^{-1}(y) = c (1 - \tilde a(y)) there:

c(1a~(y+z))c(1a~(y))=c(1a~(y))a~(z)c (1 - \tilde{a}(y + z)) - c (1 - \tilde{a}(y)) = c (1 - \tilde{a}(y)) \tilde a(z)

This leads us to equation a~(z+y)=a~(y)a~(z)\tilde{a}(z + y) = \tilde{a}(y) \tilde{a}(z), which, upon monotonous a~(y)\tilde{a}(y) has a solution a~(y)=eρy\tilde{a}(y) = e^{\rho y}. Hence:

ψ1(y)=c(1eρy)=ϕ1(y)ϕ1(0)\psi^{-1}(y) = c (1 - e^{\rho y}) = \phi^{-1}(y) - \phi^{-1}(0)

ϕ1(y)=ν+c(1eρy)\phi^{-1}(y) = \nu + c (1 - e^{\rho y}), where ν=ϕ1(0)\nu = \phi^{-1}(0).

Now recall that ϕ(x)=ln(ln(G(x)))\phi(x) = -\ln(-\ln(G(x))), and we get: x=ϕ1(ϕ(x))=ν+c(1eρln(ln(G(x))))x = \phi^{-1}(\phi(x)) = \nu + c (1 - e^{-\rho \ln(-\ln(G(x)))}). Hence:

xνc=1(lnG(x))ρ\frac{x - \nu}{c} = 1 - (-\ln G(x))^{-\rho}

(lnG(x))ρ=1xνc(-\ln G(x))^{-\rho} = 1 - \frac{x - \nu}{c}

lnG(x)=(1xνc)1ρ-\ln G(x) = (1 - \frac{x - \nu}{c})^{-\frac{1}{\rho}}

G(x)=e(1xνc)1ρG(x) = e^{-(1 - \frac{x - \nu}{c})^{-\frac{1}{\rho}}}, which is either a Frechet (Type II), or a Inverse Weibull (Type III) EVD.

Distributions not in domains of attraction of any maximum-stable distributions

We’ve shown that if maximum of n i.i.d. random variables of current distribution converge to any maximum-stable distribution, it is one of the 3 described types. However, maximum might not converge to any max-stable distribution at all.

For instance, Poisson distribution and Geometric distribution do not converge to any type of Extreme Value Distriubtion. To show this we will need much more tools in our toolbox, the corresponding theorem will be proven in the end of section 4.

3. Von Mises sufficient conditions for a distribution to belong to a type I, II or III

Richard von Mises

Richard von Mises

The Fisher-Tippett-Gnedenko theorem is an important theoretical result, but it does not provide an answer to the basic question: what type of EVD does our distribution function FF belong to?

Fortunately, there are two sets of criteria that let us determine the domain of attraction of FF. First, there are von Mises conditions, which are sufficient, but not necessary. Still, they are more intuitive and give a good insight into what kinds of distributions converge to what types of EVD and why. Second, there are general sufficient and necessary conditions. Proving them is a much more technical task and requires some extra preliminaries.

We will start with von Mises conditions, postulated by Richard von Mises in 1936, 7 years before Fisher-Tippett-Gnedenko theorem was proved by Boris Gnedenko in 1943. Von Mises conditions are formulated in terms of survival analysis. We shall introduce some basic notions from survival analysis first.

Pre-requisites from survival analysis

Definition 3.1: Survival function

Survival function S(t)S(t) is reverse of cumulative distribution function F(t)F(t): S(t)=1F(t)S(t) = 1 - F(t).

Basically, if our random variable’s value represents a human longevity, cumulative distribution funcion F(t)=p(ξt)=tf(x)dxF(t) = p(\xi \le t) = \int \limits_{-\infty}^{t} f(x) dx represents the fraction of people, who die by the time tt.

Survival function S(t)=p(ξt)=1p(ξt)=1F(t)S(t) = p(\xi \ge t) = 1 - p(\xi \le t) = 1 - F(t) on the contrary is the fraction of people, who are still alive by the time tt.

Proposition 3.1: integral of survival function equals to average life expectancy

Basically rotate survival function plot by 90 degrees to see that it is expectation of lifetime (just swap x and y axes and it becomes obvious).

Definition 3.2: Survival function end point

We shall denote the end point of survival function xF=sup{x;F(x)<1}x_F = \sup \{ x; F(x) < 1\}. It is also sometimes denoted ω(F)\omega(F).

Basically, xFx_F is the smallest point xx, where survival function S(x)S(x) becomes exactly 0. For instance, if we’re studying the survival of human, and there are known survivors at the age of 128128, but everybody dies by the age of 129 years, xF=129x_F = 129.

If there is no such limit (e.g. the population dies out exponentially S(x)=exS(x) = e^{-x} or polynomially S(x)=1xS(x) = \frac{1}{x}), we say that xF=x_F = \infty.

Definition 3.3: Tail quantile function

Tail quantile function of nn is the smallest time tt, when the fraction of survivors becomes smaller than nn:

γ(n)=inf{t;F(t)11n}=inf{t;S(t)1n}\gamma(n) = \inf \{ t; F(t) \le 1 - \frac{1}{n} \} = \inf \{ t; S(t) \ge \frac{1}{n} \}

For instance, tail quantile function of 10 is the time, when 1/10 of population is still alive.

Lemma 3.1: convergence of tail quantile function to exponent

Consider a sequence {xn}\{ x_n \} of data points, such that each xntnx_n \to t_n as nn \to \infty, where {tn}\{t_n\} are the values of tail quantile function at τn\frac{\tau}{n}:

γ(τn)=inf{tn;S(tn)τn}\gamma(\frac{\tau}{n}) = \inf \{t_n; S(t_n) \ge \frac{\tau}{n} \}

Then p(Mnxn)eτp(M_n \le x_n) \to e^{-\tau}.


(1p(Mnxn))n=(1F(tn))n=S(tn)n=(1τn)n=eτ(1 - p(M_n \le x_n))^n = (1 - F(t_n))^n = S(t_n)^n = (1 - \frac{\tau}{n})^n = e^{-\tau} (last equality by definition of exponent)

Definition 3.4: Hazard rate

Hazard rate r(t)r(t) in the same context of survival analysis is your chance of dying at the time tt.

Basically, what’s your chances to die at 64, if you’re an average person? It is the number of people, who died aged 64, to number of people, who survived by 64. In mathematical terms it is the ratio of probability density function to survival function:

r(t)=f(t)1F(t)=f(t)S(t)r(t) = \frac{f(t)}{1 - F(t)} = \frac{f(t)}{S(t)}

Definition 3.5: Cumulative hazard rate

Cumulative hazard rate R(t)=x=tr(x)dxR(t) = \int \limits_{x=-\infty}^{t} r(x) dx is integral of hazard rate over some period of time.

Cumulative hazard rate is basically the number of times you avoided death by now. Suppose you’re a train robber in the Wild West. At your first robbery your chance of being killed (hazard rate) is 1/21/2. Then you get more experienced and at the second and third times your hazard rate is 1/31/3 and 1/41/4. If you survived 3 robberies, your cumulative hazard rate equals 1/2+1/3+1/41/2 + 1/3 + 1/4. Basically, you “deserved” more than 1 death by now and are lucky to still be alive.

Proposition 3.1. Cumulative hazard rate relation to survival function

R(t)=tf(x)1F(x)dx=t11F(x)d(1F(x))=ln(1F(t))=lnS(t)R(t) = \int \limits_{-\infty}^{t} \frac{f(x)}{1 - F(x)} dx = - \int \limits_{-\infty}^{t} \frac{1}{1 - F(x)} d(1 - F(x)) = -\ln(1 - F(t)) = -\ln S(t).

Von Mises conditions proofs

Theorem 3.1: Von Mises sufficient condition for a distribution to belong to type II (Frechet) EVD

If a distribution function FξF_{\xi} has an infinite end point xF=x_F = \infty and limtrξ(t)t=α\lim \limits_{t \to \infty} r_{\xi}(t) \cdot t = \alpha, then distribution FξF_{\xi} belongs to type II (Frechet) EVD.


Speaking informally, what we aim to show is that if hazard rate function rξ(t)r_{\xi}(t) basically behaves as a hyperbolic function αt\frac{\alpha}{t} as tt \to \infty (i.e. has a fat tail, decreasing much slower that exe^{-x}), the corresponding cumulative distribution function FξD(Frechet)F_{\xi} \in \mathcal{D}(Frechet) is in the domain of attraction D(Frechet)\mathcal{D}(Frechet) of Frechet (type II) EVD.

I will drop indices ξ\xi under rξ(t)r_{\xi}(t), Fξ(t)F_{\xi}(t) and Sξ(t)S_{\xi}(t) and will just write r(t),F(t),S(t)r(t), F(t), S(t) in context of our random variable ξ\xi in question.

We start the proof by recalling the connection between the cumulative hazard rate function R(t)R(t) and survival function S(x)S(x):

R(t)=x1x2r(t)dt=lnS(x2)lnS(x1)-R(t) = -\int \limits_{x_1}^{x_2} r(t) dt = \ln S(x_2) - \ln S(x_1)

Exponentiation of both sides gets us:

ex1x2r(t)dt=S(x2)S(x1)e^{-{\int \limits_{x_1}^{x_2} r(t) dt}} = \frac{S(x_2)}{S(x_1)}

Recalling that r(t)αtr(t) \to \frac{\alpha}{t} upon tt \to \infty by the conditions of the theorem and x1x2r(t)dtx1x2αtdt=α(lnx2lnx1)-\int \limits_{x_1}^{x_2} r(t)dt \to - \int \limits_{x_1}^{x_2} \frac{\alpha}{t} dt = - \alpha \cdot (\ln x_2 - \ln x_1):

eα(lnx2lnx1)=S(x2)S(x1)e^{-\alpha \cdot (\ln x_2 - \ln x_1)} = \frac{S(x_2)}{S(x_1)}

Now take x1=γ(n)x_1 = \gamma(n) (i.e. such a point in time, where survival function S(x1)=S(γ(n))=1/nS(x_1) = S(\gamma(n)) = 1/n, we just experessed this through the tail quantile function γ(n)\gamma(n)) and x2=xx1=xγ(n)x_2 = x \cdot x_1 = x \cdot \gamma(n) and substitute it into the previous line:

eα(ln(xγ(n))lnγ(n))=S(xγ(n))S(γ(n))e^{-\alpha \cdot (\ln (x \cdot \gamma(n)) - \ln \gamma(n))} = \frac{S(x \gamma(n))}{S(\gamma(n))}

eα(lnx+lnγ(n)lnγ(n))=S(xγ(n))1ne^{-\alpha \cdot (\ln x + \ln \gamma(n) - \ln \gamma(n))} = \frac{S(x \gamma(n))}{\frac{1}{n}}

e(lnx)α=nS(xγ(n))e^{(\ln x)^{-\alpha}} = n S(x \gamma(n))

xαn=S(xγ(n))=1F(xγ(n))\frac{ x^{-\alpha} } { n } = S(x \gamma(n)) = 1 - F(x \gamma(n)) and F(xγ(n))=1xαnF(x \gamma(n)) = 1 - \frac{ x^{-\alpha} }{n}

In other words p(ξixγ(n))=1xαnp(\xi_i \le x \gamma(n)) = 1 - \frac{ x^{-\alpha} }{n} or p(maxξixγ(n))=(1xαn)n=exαp(\max \xi_i \le x \gamma(n)) = (1 - \frac{ x^{-\alpha} }{n})^n = e^{-x^{-\alpha}} or p(maxξiγ(n)x)=(1xαn)n=exαp(\max \frac{\xi_i}{ \gamma(n) } \le x ) = (1 - \frac{ x^{-\alpha} }{n})^n = e^{-x^{-\alpha}}.

We’ve just shown that a random variable anξ+bna_n \xi + b_n converges to Frechet Type II EVD, where an=γ(n)a_n = \gamma(n) and bn=0b_n = 0.

Theorem 3.2: Von Mises sufficient condition for a distribution to belong to type III (Inverse Weibull) EVD

If a distribution function FξF_{\xi} has a finite end point xFx_F \le \infty and limxxF(xFx)r(x)=α\lim \limits_{x \to x_F} (x_F - x) r(x) = \alpha, then distribution FξF_{\xi} belongs to type III (Inverse Weibull).


If our original random variable ξ\xi had a finite upper end xFx_F, let us consider a derived random variable η=1xFξ\eta = \frac{1}{x_F - \xi}.

η\eta approaches ++\infty as ξ\xi approaches upper end xFx_F and approached 0+0+ as ξ\xi approaches -\infty.

Let us look at the connection between c.d.f.s of η\eta and ξ\xi:

Fη(x)=p(ηx)=p(1xFξx)=p(1x(xFξ))=p(ξxF1x)=Fξ(xF1x)F_{\eta}(x) = p(\eta \le x) = p(\frac{1}{x_F - \xi} \le x) = p(\frac{1}{x} \le (x_F - \xi)) = p(\xi \le x_F - \frac{1}{x}) = F_{\xi}( x_F - \frac{1}{x} ).

Basically, with η\eta we created a mapping of ξ\xi onto a {0,+}\{0, +\infty\} domain. Suppose that random variable η\eta fits the conditions of Theorem 3.1:

xFη(x)1Fη(x)=xFξ(xF1x)1x21Fξ(xF1x)xα\frac{x F'_{\eta}(x)}{ 1 - F_{\eta}(x) } = \frac{x F'_{\xi}(x_F - \frac{1}{x}) \frac{1}{x^2} }{1 - F_{\xi}(x_F - \frac{1}{x})} \xrightarrow{x \to \infty} \alpha

Denote y=xF1xy = x_F - \frac{1}{x}, note that 1x=xFy\frac{1}{x} = x_F - y and substitute this into the previous result:

(xFy)Fξ(y)1Fξ(y)\frac{ (x_F -y) \cdot F'_{\xi}(y) }{1 - F_{\xi}(y)}

We came to the expression in the conditions of our theorem exactly, hence, (xFy)Fξ(y)1Fξ(y)yxFα \frac{ (x_F - y) \cdot F'_{\xi}(y) }{1 - F_{\xi}(y)} \xrightarrow{y \to x_F} \alpha.

I.e. if and only if the conditions of this theorem are satisfied, η\eta is in the domain of attraction of Type II.

Theorem 3.3: Von Mises sufficient condition for a distribution to belong to type I (Gumbel) EVD

If a distribution function FξF_{\xi}