cover

Intro to the Extreme Value Theory and Extreme Value Distribution

April 30, 2023 167 min read

Quite often in mathematical statistics I run into Extreme Value Distribution - an analogue of Central Limit Theorem, which describes the distribution of maximum/minimum, observed in a series of i.i.d random variable tosses. This is an introductory text with the basic concepts and proofs of results from extreme value theory, such as Generalized Extreme Value and Pareto distributions, Fisher-Tippett-Gnedenko theorem, von Mises conditions, Pickands-Balkema-de Haan theorem and their applications.

Contents:

  1. Problem statement and Generalized Extreme Value distribution
    • Type I: Gumbel distribution
    • Type II: Frechet distribution
    • Type III: Inverse Weibull distribution
  2. Fisher-Tippett-Gnedenko theorem
    • Examples of convergence
    • General approach: max-stable distributions as invariants/fixed points/attractors and EVD types as equivalence classes
    • Khinchin’s theorem (Law of Convergence of Types)
    • Necessary conditions of maximium stability
    • Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)
    • Distributions not in domains of attraction of any maximum-stable distributions
  3. Von Mises sufficient conditions for a distribution to belong to a type I, II or III
    • Pre-requisites from survival analysis
    • Von Mises conditions proof
    • Generalizations of von Mises condition for Type I EVD: auxiliary function and von Mises function
  4. Necessary and sufficient conditions for a distribution to belong to a type I, II or III
    • Pre-requisites from Karamata’s theory of slow/regular/Г-/П- variation
    • Necessary and sufficient conditions of convergence to Types II or III EVD
    • Necessary and sufficient conditions of convergence to Type I EVD
  5. Residual life time
    • Generalized Pareto distribution
    • Residual life time problem
    • Pickands-Balkema-de Haan theorem (a.k.a. Second Extreme Value Theorem)
  6. Order statistics and parameter estimation
    • Order statistics
    • Hill’s estimator
    • Pickands’ estimator
    • Other estimators
  7. Summary and examples of practical application
    • Examples of Type I Gumbel distribution
    • Examples of Type II Frechet distribution
    • Examples of Type III Inverse Weibull distribution
  8. Concluding remarks

1. Problem statement and Generalized Extreme Value distribution

One of the most famous results in probabilities is Central Limit Theorem, which claims that sum of nn \to \infty i.i.d. random variables ξi\xi_i after centering and normalizing converges to Gaussian distribution.

Now, what if we ask a similar question about maximum of those nn \to \infty i.i.d. random variables instead of sum? Does it converge to any distribution?

Turns out that it depends on the properties of the distribution ξi\xi_i, but not much really. Regardless of the distribution of ξi\xi_i the distribution of maximum of nn random variables ξi\xi_i is:

Gγ(x)=exp((1+γx)1γ)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}})

This distribution is called Generalized Extreme Value Distribution. Depending on the coefficient γ\gamma it can take one of three specific forms:

Type I: Gumbel distribution

If γ0\gamma \to 0, we can assume that k=1γk = \frac{1}{\gamma} \to \infty. Then generalized EVD converges to a doubly-exponential distribution (sometimes this is called a law of double logarithm) by definition of e=(1+1k)ke = (1 + \frac{1}{k})^k and ex=(1+1kx)ke^x = (1 + \frac{1}{k}x)^k:

Gγ(x)=exp((1+γx)1γ)=exp((1+1kx)k)=exp(ex)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(1 + \frac{1}{k} x)^{-k}) = exp(-e^{-x}).

This is Gumbel distribution, it oftentimes occurs in various areas, e.g. bioinformatics, describing the distribution of longest series of successes in coin tosses in nn experiments of tossing a coin 100 times.

It is often parametrized by scale and center parameters. I will keep it centered here, but will add shape parameter λ\lambda:

F(x)=eexλF(x) = e^{-e^{-\frac{x}{\lambda}}}, or, in a more intuitive notation F(x)=1eex/λF(x) = \frac{1}{\sqrt[e^{x/\lambda}]{e}}.

It is straightforward to derive probability density function f(x)f(x) from here:

f(x)=Fx=exλ(1λ)eexλ=1λexλ+exλf(x) = \frac{\partial F}{\partial x} = -e^{-\frac{x}{\lambda}} \cdot (-\frac{1}{\lambda}) \cdot e^{-e^{-\frac{x}{\lambda}}} = \frac{1}{\lambda} e^{- \frac{x}{\lambda} + e^{-\frac{x}{\lambda}}}.

import math

import numpy as np
import matplotlib.pyplot as plt


scale = 1

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(-20, 20, 0.1)

# Calculate y values
gumbel_cdf = math.e**(-math.e**(-(x/scale)))
gumbel_pdf = (1 / scale) * np.exp(-( x/scale + math.e**(-(x / scale))))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, gumbel_cdf, label='cdf')

# Plot pdf
ax.plot(x, gumbel_pdf, label='pdf')

# Set up the legend
ax.legend()

# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Gumbel pdf and cdf')

# Display the plot
plt.show()

Gumbel distribution plot

Gumbel distribution plot, scale = 1.

Type II: Frechet distribution

If γ>0\gamma > 0, let us denote k=1γk = \frac{1}{\gamma} (k > 0), y=λ(1+γx)y = \lambda \cdot (1 + \gamma x), where kk is called shape parameter and λ\lambda - scale parameter. Then distribution takes the shape:

Gγ(x)=exp((1+γx)1γ)=exp((yλ)k)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(\frac{y}{\lambda})^{-k}).

To make it more intuitive, I’ll re-write cdf in the following way: F(x)=1e(λx)kF(x) = \frac{1}{e^{(\frac{\lambda}{x})^k}}.

This is Frechet distribution. It arises when the tails of the original cumulative distribution function Fξ(x)F_{\xi}(x) are heavy, e.g. when it is Pareto distribution.

Let us derive the probability density function for it:

f(x)=Fx=(xλ)k1(k)1λe(xλ)k=kλ(xλ)k1e(xλ)kf(x) = \frac{\partial F}{\partial x} = - (\frac{x}{\lambda})^{-k - 1} \cdot (-k) \cdot \frac{1}{\lambda} \cdot e^{-(\frac{x}{\lambda})^{-k}} = \frac{k}{\lambda} \cdot (\frac{x}{\lambda})^{-k-1} \cdot e^{-(\frac{x}{\lambda})^{-k} }.

Here is the plot:

import math

import numpy as np
import matplotlib.pyplot as plt


shape = 2  # alpha
scale = 2  # beta

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(0, 20, 0.1)

# Calculate y values
frechet_cdf = math.e**(-(scale / x) ** shape)
frechet_pdf = (shape / scale) * ((scale / x) ** (shape + 1)) * np.exp(-((scale / x) ** shape))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, frechet_cdf, label='cdf')

# Plot pdf
ax.plot(x, frechet_pdf, label='pdf')

# Set up the legend
ax.legend()

# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Frechet distribution pdf and cdf')

# Display the plot
plt.show()

Frechet distribution plot

Frechet distribution plot, scale = 1, shape = 1.

Type III: Inverse Weibull distribution

If γ<0\gamma < 0, let us denote k=1γk = -\frac{1}{\gamma} (k > 0, different kinds of behaviour are observed at 0<k<10 < k < 1, k=1k = 1 and k>1k > 1), y=λ(1+γx)y = \lambda (1 + \gamma x).

Then distribution takes the shape:

Gγ(x)=exp((1+γx)1γ)=exp((yλ)k)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}) = exp(-(\frac{y}{\lambda})^k).

Gγ(x)={exp((xλ)k),x01,x>0G_{\gamma}(x) = \begin{cases} exp(-(\frac{x}{\lambda})^{k}), x \le 0 \\ 1, x > 0 \end{cases}.

This is Inverse Weibull distribution. Its direct counterpart (Weibull distribution) often occurs in survival analysis as a hazard rate function. It also arises in mining - there it describes the mass distribution of particles of size xx and is closely connected to Pareto distribution. We shall discuss this connection later.

Generalized extreme value distribution converges to Inverse Weibull, when distribution of our random variable ξ\xi is bounded. E.g. consider uniform distribution ξU(0,1)\xi \sim U(0, 1). It is clear that the maximum of nn uniformly distributed variables will be approaching 1 as nn \to \infty. Turns out that the convergence rate is described by Inverse Weibull distribution.

To make it more intuitive, we can re-write the cdf as F(x)={1e(xλ)k,x01,x>0F(x) = \begin{cases} \frac{1}{e^{ (\frac{-x}{\lambda})^k }}, x \le 0 \\ 1, x > 0 \end{cases}.

Derive from cumulative distribution function F(x)=exp((xλ)k)F(x) = exp(-(\frac{-x}{\lambda})^{k}) the probability density function:

f(x)=Fx=(xλ)k1kλexp((xλ)k)=kλ(xλ)k1exp((xλ)k)f(x) = \frac{\partial F}{\partial x} = -(\frac{-x}{\lambda})^{k-1} \cdot \frac{-k}{\lambda} \cdot exp(-(\frac{-x}{\lambda})^{k}) = \frac{k}{\lambda} \cdot (\frac{-x}{\lambda})^{k-1} \cdot exp(-(\frac{-x}{\lambda})^{k}).

Let us draw the plot:

import math

import numpy as np
import matplotlib.pyplot as plt


shape = 2  # alpha
scale = 2  # beta

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(-20, 0, 0.1)

# Calculate y values
inverse_weibull_cdf = math.e**(-(-x/scale) ** shape)
inverse_weibull_pdf = (shape / scale) * ((-x / scale) ** (shape - 1)) * np.exp(-((-x / scale) ** shape))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, inverse_weibull_cdf, label='cdf')

# Plot pdf
ax.plot(x, inverse_weibull_pdf, label='pdf')

# Set up the legend
ax.legend()

# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Inverse Weibull pdf and cdf')

# Display the plot
plt.show()

Inverse Weibull plot

Plot of Inverse Weibull distribution, shape = 2, scale = 2.


2. Fisher-Tippett-Gnedenko theorem

Extreme Value Theorem is a series of theorems, proven in the first half of 20-th century. They claim that maximum of several tosses of i.i.d. random variables converges to just one of 3 possible distributions, Gumbel, Frechet or Weibull.

Here I will lay out the outline of the proof with my comments. The proof includes introduction of several technical tools, but I will comment on their function and rationale behind each of them.

Consider a random variable MnM_n, which describes the distribution of maximum of ξi\xi_i, i1..ni \in 1..n

p(Mnx)=i=1np(ξix)=Fn(ξix)p(M_n \le x) = \prod \limits_{i=1}^{n} p(\xi_i \le x) = F^n(\xi_i \le x).

Similarly to the Central Limit Theorem, a convergence theorem might be applicable to the distribution of a normalized random variable MnM_n rather than the non-normalized:

p(Mnbnanx)=p(Mnanx+bn)=Fn(anx+bn)p(\frac{M_n - b_n}{a_n} \le x) = p(M_n \le a_n x + b_n) = F^n(a_n x + b_n)

We aim to show that for some series of constants aia_i and bib_i

Fn(anx+bn)F^n(a_n x + b_n) as nn \to \infty converges in distribution to some distribution G(x)G(x): Fn(anx+bn)nwG(x)F^n(a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

Now I will provide several examples and then informally describe the proof outline, before introducing the mathematical formalism.

Examples of convergence

Let us start with simple examples of convergence to types I and II to get a taste of how convergence to maxima works.

Example 2.1. Convergence of maximum of i.i.d. exponential random variables to Gumbel distribution

horse_and_wagon.png

I’ll never believe that a horse and a wagon age in the same way.

- Alex Comfort (as quoted by Vladimir P. Skulachev)

Consider a wagon with an exponentially-distributed life time. For instance, assume that the probability for a wagon to break down every year is 1/e1 / e.

Then let ξ\xi random variable be its lifetime and its cdf be:

Fξ(x)=p(ξx)=1exF_{\xi}(x) = p (\xi \le x) = 1 - e^{-x}

Given nn \to \infty wagons, what’s the distribution of lifetime of the longest living one?

p(maxξix)=Fn(x)=(1p(ξx))n=(1ex)n=(1ex+lnnn)nneex+lnnp (\max \xi_i \le x) = F^n(x) = (1 - p (\xi \ge x))^n = (1 - e^{-x})^n = (1 - \frac{e^{-x + \ln n}}{n})^n \xrightarrow{n \to \infty} e^{-e^{-x + \ln n}}

We see that this is almost standard Gumbel distribution (type I EVD).

In order to get the stadard one, we need to consider a shifted random variable p(maxξianx+bn)p (\max \xi_i \le a_nx + b_n) by choosing an=1a_n = 1, bn=lognb_n = \log n, so that we get:

Fn(x+logn)=(1e(x+logn))n=(1exn)nexp(ex)F^n(x + \log n) = (1 - e^{-(x + \log n)})^n = (1 - \frac{e^{-x}}{n})^n \to exp(-e^{-x}).

Example 2.2. Convergence of maximum of i.i.d. Pareto random variables to Frechet distribution

Consider another example: Pareto distribution. It describes the distribution of wealth (e.g. fraction of people, having net worth above xx), sizes of cities (e.g. fraction of cities with a population greater than xx), sizes of orders on an exchange (e.g. fraction of orders of size greater than xx), number of Telegram channels with more than xx subsribers. Here is its cdf:

Fξ(x)=p(ξx)=1(x0x)αF_{\xi}(x) = p (\xi \le x) = 1 - (\frac{x_0}{x})^{\alpha}

It is often convenient to depict it in log-log coordinates because ln(1Fξ(x))=γ(lnx0lnx)\ln (1 - F_{\xi}(x)) = \gamma (\ln x_0 - \ln x):

log_log_Pareto.png

Pareto distribtuion also describes the distribution of life time of memes, comedy shows etc. This phenomenon is known as Lindy effect. With a little calculation one can show that expected lifetime of a comedy show equals to the time it’s already been around, which leads to Pareto distribution of its cdf.

Again, let us consider the distribution of maximum lifetime of the comedy shows then:

p(maxξix)=Fn(x)=(1p(ξx))n=(1(x0x)α)n=(1(nαx0x)αn)n=(1(λx)αn)n=e(λx)αp (\max \xi_i \le x) = F^n(x) = (1 - p (\xi \ge x))^n = (1 - (\frac{x_0}{x})^{\alpha})^n = (1 - \frac{(\frac{\sqrt[\alpha]{n} x_0}{x})^{\alpha}}{n})^n = (1 - \frac{(\frac{\lambda}{x})^{\alpha}}{n})^n = e^{-(\frac{\lambda}{x})^{\alpha}}

We see that it converges to Frechet (EVD type II) distribution.

Example 2.3. Convergence of minimum/maximum of i.i.d. Weibull/inverse Weibull random variables to Weibull/inverse Weibull distribution

Consider a single link of a chain, the strength of which is described by a Weibull distribution Fξ(x)=p(ξx)=1e(xx0)mF_{\xi}(x) = p(\xi \le x) = 1 - e^{-(\frac{x}{x_0})^m}:

link

In this specific application of Weibull distribution we may say that the strength of a link S(F)S(F) is the probability that it does not break under application of force FF (sorry for the duplication of symbol FF in the notation, used as both cdf and force).

Then we can show that strength of the whole chain is also described by Weibull distribution, as the chain is as strong as its weakest link:

chain

Fchain(Fi)=F(minξiFi)n=ein(FiF0)m=en(FiF0)m=e(FiH0)mF_{chain}(F_i) = F(\min \xi_i \le F_i)^n = e^{- \sum \limits_{i}^{n} (\frac{F_i}{F_0})^m } = e^{-n (\frac{F_i}{F_0})^m } = e^{-(\frac{F_i}{H_0})^m}

We see that the whole chain’s strength also follows the Weibull distribution, but with somewhat altered parameter:

Weibull stability

Similarly, we may consider a zipper, where strength of a zipper is the strength of the hardest-to-open element. Hence, strength of a zipper and of each element of it is described by inverse Weibull distribution:

zipper

Later on we will introduce the concept of max-stable/min-stable distributions, i.e. such distributions that maximum/minimum of them follows the same distribution. With this example we can already see that Weibull distribtuion is min-stable and inverse Weibull distribution is max-stable.


In these examples we used the same argument: (11F(x)n)nne(1F(x))(1 - \frac{1 - F(x)}{n})^n \xrightarrow{n} e^{-(1 - F(x))}.

It might seem that we can substitute any distribution as (1 - F(x)) and make the distribution of maximum e(1F(x))e^{-(1 - F(x))} to take almost any shape.

However, it turns out that this intuition is wrong, as there are just 3 possible shapes of Extreme Value Distribution.

Let’s see, how (counter-intuitively) the maximum of i.i.d. Gaussian random variables converges to Gumbel.

Example 2.4. Convergence of maximum of i.i.d. Gaussian random variables to Gumbel distribution

Step 1. Integration in parts

Let us consider the cdf of normal distribution F(x)=p(ξx)=12πxet2/2dt=1p(ξx)=112πxet2/2dtF(x) = p(\xi \le x) = \frac{1}{\sqrt{2 \pi}} \int \limits_{-\infty}^{x} e^{-t^2/2} dt = 1 - p(\xi \ge x) = 1 - \frac{1}{\sqrt{2 \pi}} \int \limits_{x}^{\infty} e^{-t^2/2} dt.

Do integration in parts: 12πxet2/2dt=12πex22x12πxet2/2t2dt\frac{1}{\sqrt{2 \pi}} \int \limits_{x}^{\infty} e^{-t^2/2} dt = \frac{1}{\sqrt{2 \pi}} \frac{e^{-\frac{x^2}{2}}}{x} - \frac{1}{\sqrt{2\pi}} \int \limits_{x}^{\infty} \frac{e^{-t^2/2}}{t^2} dt

Recall that we are interested in maximum of anx+bna_nx + b_n, not just xx.

Hence, we want to study the behaviour of 12πe(anx+bn)22anx+bn\frac{1}{\sqrt{2 \pi}} \frac{e^{-\frac{(a_n x + b_n)^2}{2}}}{a_n x + b_n} and to show that 12πe(anx+bn)22anx+bnexn\frac{1}{\sqrt{2 \pi}} \frac{e^{-\frac{(a_n x + b_n)^2}{2}}}{a_n x + b_n} \sim \frac{e^{-x}}{n}.

Step 2. Split result in two multipliers, select ana_n

Let us use our freedom to choose ana_n and bnb_n.

First let us choose ana_n, while letting bnb_n \to \infty (we will choose a specific bnb_n at a later stage).

Let an=1bna_n = \frac{1}{b_n}:

12πe(anx+bn)22anx+bn=12πex2/2bn2exx/bn2+1ebn2/2bn\frac{1}{\sqrt{2 \pi}} \frac{e^{-\frac{(a_n x + b_n)^2}{2}}}{a_n x + b_n} = \frac{1}{\sqrt{2 \pi}} \frac{e^{-x^2 / 2 b_n^2} e^{-x}}{x / {b_n}^2 + 1} \frac{e^{-{b_n}^2 / 2}}{b_n}

By evaluating the square of anx+bna_nx + b_n we’ve managed to come up with 2 multipliers. First of them contains exe^{-x}, which is required for convergence to Gumbel distribution. Note that as bnb_n \to \infty, first term converges to exe^{-x}:

ex2/2bn2exx/bn2+1bnex\frac{e^{-x^2 / 2 b_n^2} e^{-x}}{x / {b_n}^2 + 1} \xrightarrow{b_n \to \infty} e^{-x}

The second term ebn2/2bn\frac{e^{-{b_n}^2 / 2}}{b_n} will produce 1/n1 / n that we need to obtain (1F(x)/n)n(1 - F(x)/n)^n.

Step 3. Select bnb_n

Choose bn=F1(11/n)b_n = F^{-1}(1 - 1/n), so that:

ebn2/2bn1F(F1(11/n))=1n\frac{e^{-{b_n}^2 / 2}}{b_n} \to 1 - F(F^{-1}(1 - 1/n)) = \frac{1}{n}

Hence, we end up having F(anx+bn)1ex/nF(a_n x + b_n) \to 1 - e^{-x}/n, where bn=F1(11/n)b_n = F^{-1}(1 - 1/n) and an=1bna_n = \frac{1}{b_n}.

Then p(maxξianx+bn)=p(ξianx+bn)n=(1ex/n)n=eexp(\max \xi_i \le a_nx + b_n ) = p(\xi_i \le a_nx + b_n)^n = (1 - e^{-x}/n)^n = e^{-e^{-x}}.

This is just one example of a distribution, for which it doesn’t appear at the first glance that its maximum would converge to one of 3 types of EVD, but its maximum actually does converge to Type I. We shall prove as a theorem that for i.i.d. random variables from any distribution there are just 3 options for their maximum to converge to. Let’s first discuss the general approach of how this could be shown, and then will proceed with a formal proof.

General approach: max-stable distributions as invariants/fixed points/attractors and EVD types as equivalence classes

I assume that all three types of Extreme Value Distribution were first discovered experimentally. Later statisticians came up with a proof that EVD can converge to just one of three possible types of distributions and no other types of EVD can exist. Finally, they came up with criteria for a distribution to belong to each type.

Design of this proof is similar to many other proofs. I will outline it informally here:

Assume that as the number of random variables nn \to \infty increases, approaching infinity, the distribution of the observed maximum approaches some type of distribution. Then such a distribution type can be considered as an invariant or attractor or fixed point, similar to many other mathematical problems. For instance, eigenvectors are fixed points of matrix multiplication. E.g. matrix eigenvector, multiplied by a matrix, results in itself, multiplied by a scalar. Or no matter how many times you take a derivative of ekxe^{kx}, you get ekxe^{kx}, multiplied by a scalar kk.

Similarly, maximum-stable distributions are invariant objects. Those are distributions, maximum of i.i.d. variables of which converges to themselves, no matter how many more i.i.d. random variables you toss. E.g. if for one Gumbel-distributed random variable ξ\xi we know that pξ(M1b1a1x)=eexp_{\xi}(\frac{M_1 - b_1}{a_1} \le x) = e^{-e^{-x}}, for nn \to \infty Gumbel-distributed random variables the maximum of ξ1..ξn\xi_1.. \xi_n still is Gumbel-distributed (after centering and normalizing them by some numbers ana_n, bnb_n): pMn(Mnbnanx)=eexp_{M_{n}}(\frac{M_n - b_n}{a_n} \le x) = e^{-e^{-x}}.

Ok. Then after we established that there are some distributions, for which maximum of nn \to \infty centered and normalized i.i.d. variables produces a random variable with the same distribution, how do we show that all distributions converge to one of them?

We’ll use another classical mathematical tool: equivalence classes and equivalence relation. For instance, odd numbers and even numbers form two equivalence classes under operation of modulo 2. Odd numbers are equivalent to each other in terms of producing remainder 1 (e.g. 353 \sim 5, where \sim is equivalence relation of modulo 2), and even numbers are equivalent in terms of producing remainder 0.

Similarly, we will show that types of EVD form equivalence classes under the operation of finding maximum of nn \to \infty i.i.d. random variables with any distribution, and as a result all the distributions converge to one of those types. E.g. Pareto’s distribution is equivalent to Cauchy distribution under equivalence relation of convergence of maximum of nn \to \infty Pareto/Cauchy i.i.d’s to the same maximum stable type II (Frechet) EVD.

Now that I’ve laid out the plan of the proof, it is time to get into technicalities. I will formally introduce the concepts I mentioned above and prove some lemmas about their relatedness.

Definition 2.1: Max-stable cumulative distribution function

GG is max-stable if for all n1..Nn \in 1..N and for all x there exists {an},{bn}R+\{a_n\}, \{b_n\} \subset \mathbb{R}^+ such that for all xRx \in \mathbb{R} G(x)=Gn(anx+bn)G(x) = G_n(a_n x + b_n).

Definition 2.2: Domain of attraction

If FF is a cdf, then FF is in the domain of attraction (for maxima) of GG, and it is written FD(G)F \in \mathcal{D}(G), when there exist sequences {an},{bn}R+\{a_n\}, \{b_n\} \subset \mathbb{R}^+ such that Fn(anx+bn)nwG(x)F^n (a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

Definition 2.3: Type of convergence

If G(x)G^*(x) is another non-degenerate cdf, we say that GG and GG^* have the same type if for all xx there exist a>0a > 0 and bRb \in R such that for every x ∈ R G(ax+b)=G(x)G^*(ax + b) = G(x).

Khinchin’s theorem (Law of Convergence of Types)

Khinchin

Aleksandr Yakovlevich Khinchin

Lemma 2.1: Khinchin’s theorem (law of Convergence of Types)

Suppose that we have a sequence of distribution functions {Fn}\{F_n\} (e.g. the distributions of maximum of random variable ξi\xi_i in nn experiments).

Let those distribution functions upon nn \to \infty converge to a certain distribution G(x)G(x): Fn(anx+bn)nwG(x)F_n(a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x). Then we have two series of constants {an},{bn}\{a_n\}, \{b_n\}.

Suppose there is another distribution function H(x)H(x) such that the sequence of distributions Fn(αnx+βn)F_n(\alpha_n x + \beta_n) converges to that function: Fn(αnx+βn)nwH(x)F_n(\alpha_n x + \beta_n) \xrightarrow[n \to \infty]{w} H(x) and there is a different pair of series {αn},{βn}\{ \alpha_n \}, \{\beta_n \}.

Then H(x)=G(Ax+B)H(x) = G(Ax + B) and A=αnanA = \frac{\alpha_n}{a_n}, B=βnbnanB = \frac{\beta_n - b_n}{a_n}.

Proof:

Consider two distribution functions G(x)G(x) and H(x)H(x), such that for every xx: y=F(ax+b)y = F(ax+b) and y=F(αx+β)y = F(\alpha x + \beta).

Denote y=F(ax+b)G(x)y = F(ax + b) \to G(x). Then F1(y)=ax+bF^{-1}(y) = ax + b and x=F1(y)baG1(y)x = \frac{F^{-1}(y) - b}{a} \to G^{-1}(y).

Similarly y=F(αx+β)H(x)y = F(\alpha x + \beta) \to H(x) and F1(y)=αx+βF^{-1}(y) = \alpha x + \beta and x=F1(y)βαH1(y)x = \frac{F^{-1}(y) - \beta}{\alpha} \to H^{-1}(y).

Now choose two points: x1x_1, corresponding to y1y_1, and x2x_2, corresponding to y2y_2 and subtract x1x_1 and x2x_2 from each other:

x1x2=F1(y1)F1(y2)aG1(y1)G1(y2)x_1 - x_2 = \frac{F^{-1}(y_1) - F^{-1}(y_2)}{a} \to G^{-1}(y_1) - G^{-1}(y_2)

Apply the same for H1H^{-1}:

x1x2=F1(y1)F1(y2)αH1(y1)H1(y2)x_1 - x_2 = \frac{F^{-1}(y_1) - F^{-1}(y_2)}{\alpha} \to H^{-1}(y_1) - H^{-1}(y_2)

Which results in G1(y1)G1(y2)H1(y1)H1(y2)αa=A\frac{G^{-1}(y_1) - G^{-1}(y_2)}{H^{-1}(y_1) - H^{-1}(y_2)} \to \frac{\alpha}{a} = A.

Substitute α=Aa\alpha = A \cdot a into H1(y)x=F1(y)βAaH^{-1}(y) \to x = \frac{F^{-1}(y) - \beta}{A \cdot a} and AH1(y)Ax=F1(y)βaA \cdot H^{-1}(y) \to A \cdot x = \frac{F^{-1}(y) - \beta}{a}.

On the other hand we recall that G1(y)x=F1(y)baG^{-1}(y) \to x = \frac{F^{-1}(y) - b}{a}. Subtracting these, we get: AH1(y)G1(y)F1(y)βaF1(y)ba=bβaA \cdot H^{-1}(y) - G^{-1}(y) \to \frac{F^{-1}(y) - \beta}{a} - \frac{F^{-1}(y) - b}{a} = \frac{b - \beta}{a} or βba=BG1(y)AH1(y)\frac{\beta - b}{a} = B \to G^{-1}(y) - A \cdot H^{-1}(y).

Hence, G1(y)AH1(y)+BG^{-1}(y) \to A \cdot H^{-1}(y) + B.

Lemma 2.2: Necessary condition of maximum-stability

Given G a non-degenerate cdf:

  1. G is max-stable if and only if there exists a sequence {Fn}\{F_n\} of cdf ’s and sequences

{an}R+\{a_n\} \subset \mathbb{R}^+, {bn}\{b_n\} such that for all kNk \in N Fn(ankx+bnk)nwG1/k(x)F_n(a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

  1. D(G)0\mathcal{D}(G) \neq 0 if and only if GG is max-stable. In that case, GD(G)G \in \mathcal{D}(G).

Proof:

Proposition 1 direct statement: if GG is max-stable, there exists {Fn}\{F_n\} such that …

If GG is max-stable, then by definition for every nNn \in \mathbb{N} there exist ana_n, bnb_n, such that Gn(anx+bn)=G(x)G^{n}(a_n x + b_n) = G(x).

Define Fn=GnF_n = G^n. Then Fnk(ankx+bnk)=Gnk(ankx+bnk)=GF^k_n(a_{nk} x + b_{nk}) = G^{nk}(a_{nk} x + b_{nk}) = G. We arrive at the direct statement.

Proposition 1 reverse statement: if GG is max-stable, there exists {Fn}\{F_n\} such that …

Let us proof the reverse statement: suppose that the sequences {Fn}\{F^n\}, {an}\{a_n\}, {bn}\{b_n\} exist, such that for all kNk \in \mathbb{N}:

Fn(ankx+bnk)nwG1/k(x)F_n(a_{nk}x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

Then consider k=1k=1 and k=2k=2:

Fn(anx+bn)nwG(x)F_n(a_{n}x + b_{n}) \xrightarrow[n \to \infty]{w} G(x) and Fn(a2nx+b2n)nwG1/2(x)F_n(a_{2n}x + b_{2n}) \xrightarrow[n \to \infty]{w} G^{1/2}(x)

By Khinchin’s lemma there exists G(α2x+β2)=G1/2(x)G(\alpha_2 x+ \beta_2) = G^{1/2}(x).

Similarly, for every other kk: G(αkx+βk)=G1/k(x)G(\alpha_k x + \beta_k) = G^{1/k}(x) or Gk(αkx+βk)=G(x)G^k(\alpha_k x + \beta_k) = G(x), which is the definition of max-stability.

Proposition 2 direct statement:

The proof is self-evident: if G is max-stable, Gn(anx+bn)=G(x)G^n(a_n x + b_n) = G(x), and GD(G)G \in \mathcal{D}(G) by defintion.

Proposition 2 reverse statement:

Assume FD(G)F \in \mathcal{D}(G), i.e. Fn(anx+bn)nwG(x)F^n (a_n x + b_n) \xrightarrow[n \to \infty]{w} G(x).

For all kNk \in \mathbb{N} we have Fnk(ankx+bnk)nwG(x)F^{nk} (a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G(x).

Hence, Fn(ankx+bnk)nwG1/k(x)F^{n} (a_{nk} x + b_{nk}) \xrightarrow[n \to \infty]{w} G^{1/k}(x)

This makes GG and GkG^k fit for the conditions of previous result, proving that GG is max-stable.

Corollary 2.1:

Let GG be a max-stable cdf. Then there exist functions a(s)>0a(s) > 0 and b(s)b(s) such that for all xRx \in \mathbb{R}, for all s>0s > 0, Gs(a(s)x+b(s))=G(x)G^s(a(s)x + b(s)) = G(x).

Corollary is self-evident from inversion of indices s=1ks = \frac{1}{k}.

Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)

Sir Ronald Aylmer Fisher Leonard Henry Caleb Tippett Boris Vladimirovich Gnedenko
Fisher Tippett Gnedenko

Theorem 2.1: Fisher-Tippett-Gnedenko theorem (Extreme Value Theorem)

Let ξi\xi_i be a sequence of i.i.d. random variables.

If there exist constants an>0a_n > 0, bnRb_n \in \mathbb{R} and some non-degenerate cumulative distribution function GG such that MnbnanG\frac{M_n - b_n}{a_n} \sim G, then GG is one of these:

(Type I) Gumbel: G(x)=exp(ex)G(x) = exp(-e^{-x}), xRx \in \mathbb{R},

(Type II) Frechet: G(x)=exp(xα)G(x) = exp(-x^{-\alpha}), x0,α>0x \ge 0, \alpha > 0,

(Type III) Inverse Weibull: G(x)=exp((x)α)G(x) = exp(-(-x)^{\alpha}), x0,α>0x \le 0, \alpha > 0.

Proof

Here we give the proof of Fisher-Tippett-Gnedenko theorem without introducing any additional pre-requisites and intermediate constructs. Because of that it might look like black magic now. It is not clear, how anyone could’ve come up with this proof.

However, later on in parts 3 and 4 we will give the definitions of tail quantile function and tools from Karamata’s theory of slow/regular variation.

If you revisit this proof afterwards, you will notice that we’re making use of those tools, without naming them explicitly.

Step 1.

Consider double negative logarithm of max-stable distribution G(a(s)x+b(s))s=G(x)G(a(s)x + b(s))^s = G(x).

ln(ln(G(a(s)x+b(s))s))=ln(sln(G(a(s)x+b(s))))=ln(ln(G(a(s)x+b(s))))lns=ln(lnG(x))-\ln(-\ln(G(a(s)x + b(s))^{s})) = -\ln( -s \cdot \ln(G(a(s)x + b(s)))) = -\ln(-\ln(G(a(s)x + b(s)))) - \ln s = -\ln(-\ln G(x))

Step 2.

Denote ϕ(x)=ln(ln(G(x)))\phi(x) = -\ln(-\ln(G(x))). Then from previous ϕ(a(s)x+b(s))lns=ϕ(x)\phi(a(s)x + b(s)) - \ln s = \phi(x).

Step 3.

Denote y=ϕ(x)y = \phi(x). Apply ϕ1\phi^{-1} to both sides. We get: ϕ1(ϕ(a(s)x+b(s)))=y+lns\phi^{-1}(\phi(a(s)x + b(s))) = y + \ln s.

a(s)x+b(s)=ϕ1(y+lns)a(s)x + b(s) = \phi^{-1}(y + \ln s)

a(s)ϕ1(y)+b(s)=ϕ1(y+lns)a(s) \phi^{-1}(y) + b(s) = \phi^{-1}(y + \ln s)

ϕ1(y)=ϕ1(y+lns)b(s)a(s)\phi^{-1}(y) = \frac{\phi^{-1}(y + \ln s) - b(s)}{a(s)}

Step 4.

Note that ϕ1(0)=ϕ1(lns)b(s)a(s)\phi^{-1}(0) = \frac{\phi^{-1}(\ln s) - b(s)}{a(s)}. Subtract ϕ1(0)\phi^{-1}(0) from both sides:

ϕ1(y)ϕ1(0)=ϕ1(y+lns)b(s)a(s)ϕ1(lns)b(s)a(s)=ϕ1(y+lns)ϕ1(lns)a(s)\phi^{-1}(y) - \phi^{-1}(0) = \frac{\phi^{-1}(y + \ln s) - b(s)}{a(s)} - \frac{\phi^{-1}(\ln s) - b(s)}{a(s)} = \frac{\phi^{-1}(y + \ln s) - \phi^{-1}(\ln s)}{a(s)}

Step 5.

Substitute variables: ψ1(y)=ϕ1(y)ϕ1(0)\psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0), z=lnsz = \ln s, a~(z)=a(ez)\tilde a(z) = a(e^z). Then:

ψ1(y)=ϕ1(y)ϕ1(0)=ϕ1(y+lns)ϕ1(lns)a(s)=ψ1(y+z)ψ1(z)a~(z)\psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0) = \frac{\phi^{-1}(y + \ln s) - \phi^{-1}(\ln s)}{a(s)} = \frac{\psi^{-1}(y + z) - \psi^{-1}(z)}{\tilde a(z)}

ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z)

Step 6.

We can swap yy and zz in previous equation, settings y=zy = z and z=yz = y:

ψ1(y+z)ψ1(y)=ψ1(z)a~(y)\psi^{-1}(y + z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y)

After that subtract ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z) from ψ1(y+z)ψ1(y)=ψ1(z)a~(y)\psi^{-1}(y + z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y):

ψ1(z)ψ1(y)=ψ1(z)a~(y)ψ1(y)a~(z)\psi^{-1}(z) - \psi^{-1}(y) = \psi^{-1}(z) \tilde a(y) - \psi^{-1}(y) \tilde a(z)

ψ1(z)(1a~(y))=ψ1(y)(1a~(z))\psi^{-1}(z) (1 - \tilde a(y)) = \psi^{-1}(y) (1 - \tilde a(z))

Here we consider two cases.

Step 7a.

If a~(z)=1\tilde{a}(z) = 1, previous equation leads us to 0=00 = 0. But then let’s substitute a~(z)=1\tilde{a}(z) = 1 into the result of step 5:

ψ1(y+z)=ψ1(y)+ψ1(z)\psi^{-1}(y + z) = \psi^{-1}(y) + \psi^{-1}(z)

This means that ψ1(y)=ρy\psi^{-1}(y) = \rho y and denoting ν=ϕ1(0)\nu = \phi^{-1}(0), we get:

ρy=ψ1(y)=ϕ1(y)ϕ1(0)=ϕ1(y)ν\rho y = \psi^{-1}(y) = \phi^{-1}(y) - \phi^{-1}(0) = \phi^{-1}(y) - \nu

ϕ1(y)=ν+ρy\phi^{-1}(y) = \nu + \rho y

x=ϕ1(ϕ(x))=ν+ρln(ln(G(x)))x = \phi^{-1}(\phi(x)) = \nu + \rho \ln(-\ln(-G(x)))

G(x)=exp(exνρ)G(x) = exp(-e^{-\frac{x - \nu}{\rho}}), which is Gumbel (Type I) EVD.

Step 7b.

If a~(z)1\tilde{a}(z) \ne 1:

ψ1(y)=ψ1(z)(1a~(z))(1a~(y))=c(1a~(y))\psi^{-1}(y) = \frac{ \psi^{-1}(z) }{ (1 - \tilde a(z)) } (1 - \tilde a(y)) = c (1 - \tilde a(y))

Now recall that ψ1(y+z)ψ1(z)=ψ1(y)a~(z)\psi^{-1}(y + z) - \psi^{-1}(z) = \psi^{-1}(y) \tilde a(z) and substitute ψ1(y)=c(1a~(y))\psi^{-1}(y) = c (1 - \tilde a(y)) there:

c(1a~(y+z))c(1a~(y))=c(1a~(y))a~(z)c (1 - \tilde{a}(y + z)) - c (1 - \tilde{a}(y)) = c (1 - \tilde{a}(y)) \tilde a(z)

This leads us to equation a~(z+y)=a~(y)a~(z)\tilde{a}(z + y) = \tilde{a}(y) \tilde{a}(z), which, upon monotonous a~(y)\tilde{a}(y) has a solution a~(y)=eρy\tilde{a}(y) = e^{\rho y}. Hence:

ψ1(y)=c(1eρy)=ϕ1(y)ϕ1(0)\psi^{-1}(y) = c (1 - e^{\rho y}) = \phi^{-1}(y) - \phi^{-1}(0)

ϕ1(y)=ν+c(1eρy)\phi^{-1}(y) = \nu + c (1 - e^{\rho y}), where ν=ϕ1(0)\nu = \phi^{-1}(0).

Now recall that ϕ(x)=ln(ln(G(x)))\phi(x) = -\ln(-\ln(G(x))), and we get: x=ϕ1(ϕ(x))=ν+c(1eρln(ln(G(x))))x = \phi^{-1}(\phi(x)) = \nu + c (1 - e^{-\rho \ln(-\ln(G(x)))}). Hence:

xνc=1(lnG(x))ρ\frac{x - \nu}{c} = 1 - (-\ln G(x))^{-\rho}

(lnG(x))ρ=1xνc(-\ln G(x))^{-\rho} = 1 - \frac{x - \nu}{c}

lnG(x)=(1xνc)1ρ-\ln G(x) = (1 - \frac{x - \nu}{c})^{-\frac{1}{\rho}}

G(x)=e(1xνc)1ρG(x) = e^{-(1 - \frac{x - \nu}{c})^{-\frac{1}{\rho}}}, which is either a Frechet (Type II), or a Inverse Weibull (Type III) EVD.

Distributions not in domains of attraction of any maximum-stable distributions

We’ve shown that if maximum of n i.i.d. random variables of current distribution converge to any maximum-stable distribution, it is one of the 3 described types. However, maximum might not converge to any max-stable distribution at all.

For instance, Poisson distribution and Geometric distribution do not converge to any type of Extreme Value Distriubtion. To show this we will need much more tools in our toolbox, the corresponding theorem will be proven in the end of section 4.


3. Von Mises sufficient conditions for a distribution to belong to a type I, II or III

Richard von Mises

Richard von Mises

The Fisher-Tippett-Gnedenko theorem is an important theoretical result, but it does not provide an answer to the basic question: what type of EVD does our distribution function FF belong to?

Fortunately, there are two sets of criteria that let us determine the domain of attraction of FF. First, there are von Mises conditions, which are sufficient, but not necessary. Still, they are more intuitive and give a good insight into what kinds of distributions converge to what types of EVD and why. Second, there are general sufficient and necessary conditions. Proving them is a much more technical task and requires some extra preliminaries.

We will start with von Mises conditions, postulated by Richard von Mises in 1936, 7 years before Fisher-Tippett-Gnedenko theorem was proved by Boris Gnedenko in 1943. Von Mises conditions are formulated in terms of survival analysis. We shall introduce some basic notions from survival analysis first.

Pre-requisites from survival analysis

Definition 3.1: Survival function

Survival function S(t)S(t) is reverse of cumulative distribution function F(t)F(t): S(t)=1F(t)S(t) = 1 - F(t).

Basically, if our random variable’s value represents a human longevity, cumulative distribution funcion F(t)=p(ξt)=tf(x)dxF(t) = p(\xi \le t) = \int \limits_{-\infty}^{t} f(x) dx represents the fraction of people, who die by the time tt.

Survival function S(t)=p(ξt)=1p(ξt)=1F(t)S(t) = p(\xi \ge t) = 1 - p(\xi \le t) = 1 - F(t) on the contrary is the fraction of people, who are still alive by the time tt.

Proposition 3.1: integral of survival function equals to average life expectancy

Basically rotate survival function plot by 90 degrees to see that it is expectation of lifetime (just swap x and y axes and it becomes obvious).

Definition 3.2: Survival function end point

We shall denote the end point of survival function xF=sup{x;F(x)<1}x_F = \sup \{ x; F(x) < 1\}. It is also sometimes denoted ω(F)\omega(F).

Basically, xFx_F is the smallest point xx, where survival function S(x)S(x) becomes exactly 0. For instance, if we’re studying the survival of human, and there are known survivors at the age of 128128, but everybody dies by the age of 129 years, xF=129x_F = 129.

If there is no such limit (e.g. the population dies out exponentially S(x)=exS(x) = e^{-x} or polynomially S(x)=1xS(x) = \frac{1}{x}), we say that xF=x_F = \infty.

Definition 3.3: Tail quantile function

Tail quantile function of nn is the smallest time tt, when the fraction of survivors becomes smaller than nn:

γ(n)=inf{t;F(t)11n}=inf{t;S(t)1n}\gamma(n) = \inf \{ t; F(t) \le 1 - \frac{1}{n} \} = \inf \{ t; S(t) \ge \frac{1}{n} \}

For instance, tail quantile function of 10 is the time, when 1/10 of population is still alive.

Lemma 3.1: convergence of tail quantile function to exponent

Consider a sequence {xn}\{ x_n \} of data points, such that each xntnx_n \to t_n as nn \to \infty, where {tn}\{t_n\} are the values of tail quantile function at τn\frac{\tau}{n}:

γ(τn)=inf{tn;S(tn)τn}\gamma(\frac{\tau}{n}) = \inf \{t_n; S(t_n) \ge \frac{\tau}{n} \}

Then p(Mnxn)eτp(M_n \le x_n) \to e^{-\tau}.

Proof:

(1p(Mnxn))n=(1F(tn))n=S(tn)n=(1τn)n=eτ(1 - p(M_n \le x_n))^n = (1 - F(t_n))^n = S(t_n)^n = (1 - \frac{\tau}{n})^n = e^{-\tau} (last equality by definition of exponent)

Definition 3.4: Hazard rate

Hazard rate r(t)r(t) in the same context of survival analysis is your chance of dying at the time tt.

Basically, what’s your chances to die at 64, if you’re an average person? It is the number of people, who died aged 64, to number of people, who survived by 64. In mathematical terms it is the ratio of probability density function to survival function:

r(t)=f(t)1F(t)=f(t)S(t)r(t) = \frac{f(t)}{1 - F(t)} = \frac{f(t)}{S(t)}

Definition 3.5: Cumulative hazard rate

Cumulative hazard rate R(t)=x=tr(x)dxR(t) = \int \limits_{x=-\infty}^{t} r(x) dx is integral of hazard rate over some period of time.

Cumulative hazard rate is basically the number of times you avoided death by now. Suppose you’re a train robber in the Wild West. At your first robbery your chance of being killed (hazard rate) is 1/21/2. Then you get more experienced and at the second and third times your hazard rate is 1/31/3 and 1/41/4. If you survived 3 robberies, your cumulative hazard rate equals 1/2+1/3+1/41/2 + 1/3 + 1/4. Basically, you “deserved” more than 1 death by now and are lucky to still be alive.

Proposition 3.1. Cumulative hazard rate relation to survival function

R(t)=tf(x)1F(x)dx=t11F(x)d(1F(x))=ln(1F(t))=lnS(t)R(t) = \int \limits_{-\infty}^{t} \frac{f(x)}{1 - F(x)} dx = - \int \limits_{-\infty}^{t} \frac{1}{1 - F(x)} d(1 - F(x)) = -\ln(1 - F(t)) = -\ln S(t).

Von Mises conditions proofs

Theorem 3.1: Von Mises sufficient condition for a distribution to belong to type II (Frechet) EVD

If a distribution function FξF_{\xi} has an infinite end point xF=x_F = \infty and limtrξ(t)t=α\lim \limits_{t \to \infty} r_{\xi}(t) \cdot t = \alpha, then distribution FξF_{\xi} belongs to type II (Frechet) EVD.

Proof:

Speaking informally, what we aim to show is that if hazard rate function rξ(t)r_{\xi}(t) basically behaves as a hyperbolic function αt\frac{\alpha}{t} as tt \to \infty (i.e. has a fat tail, decreasing much slower that exe^{-x}), the corresponding cumulative distribution function FξD(Frechet)F_{\xi} \in \mathcal{D}(Frechet) is in the domain of attraction D(Frechet)\mathcal{D}(Frechet) of Frechet (type II) EVD.

I will drop indices ξ\xi under rξ(t)r_{\xi}(t), Fξ(t)F_{\xi}(t) and Sξ(t)S_{\xi}(t) and will just write r(t),F(t),S(t)r(t), F(t), S(t) in context of our random variable ξ\xi in question.

We start the proof by recalling the connection between the cumulative hazard rate function R(t)R(t) and survival function S(x)S(x):

R(x1)R(x2)=x1x2r(t)dt=lnS(x2)lnS(x1)R(x_1) - R(x_2) = -\int \limits_{x_1}^{x_2} r(t) dt = \ln S(x_2) - \ln S(x_1)

Exponentiation of both sides gets us:

ex1x2r(t)dt=S(x2)S(x1)e^{-{\int \limits_{x_1}^{x_2} r(t) dt}} = \frac{S(x_2)}{S(x_1)}

Recalling that r(t)αtr(t) \to \frac{\alpha}{t} upon tt \to \infty by the conditions of the theorem and x1x2r(t)dtx1x2αtdt=α(lnx2lnx1)-\int \limits_{x_1}^{x_2} r(t)dt \to - \int \limits_{x_1}^{x_2} \frac{\alpha}{t} dt = - \alpha \cdot (\ln x_2 - \ln x_1):

eα(lnx2lnx1)=S(x2)S(x1)e^{-\alpha \cdot (\ln x_2 - \ln x_1)} = \frac{S(x_2)}{S(x_1)}

Now take x1=γ(n)x_1 = \gamma(n) (i.e. such a point in time, where survival function S(x1)=S(γ(n))=1/nS(x_1) = S(\gamma(n)) = 1/n, we just experessed this through the tail quantile function γ(n)\gamma(n)) and x2=xx1=xγ(n)x_2 = x \cdot x_1 = x \cdot \gamma(n) and substitute it into the previous line:

eα(ln(xγ(n))lnγ(n))=S(xγ(n))S(γ(n))e^{-\alpha \cdot (\ln (x \cdot \gamma(n)) - \ln \gamma(n))} = \frac{S(x \gamma(n))}{S(\gamma(n))}

eα(lnx+lnγ(n)lnγ(n))=S(xγ(n))1ne^{-\alpha \cdot (\ln x + \ln \gamma(n) - \ln \gamma(n))} = \frac{S(x \gamma(n))}{\frac{1}{n}}

e(lnx)α=nS(xγ(n))e^{(\ln x)^{-\alpha}} = n S(x \gamma(n))

xαn=S(xγ(n))=1F(xγ(n))\frac{ x^{-\alpha} } { n } = S(x \gamma(n)) = 1 - F(x \gamma(n)) and F(xγ(n))=1xαnF(x \gamma(n)) = 1 - \frac{ x^{-\alpha} }{n}

In other words p(ξixγ(n))=1xαnp(\xi_i \le x \gamma(n)) = 1 - \frac{ x^{-\alpha} }{n} or p(maxξixγ(n))=(1xαn)n=exαp(\max \xi_i \le x \gamma(n)) = (1 - \frac{ x^{-\alpha} }{n})^n = e^{-x^{-\alpha}} or p(maxξiγ(n)x)=(1xαn)n=exαp(\max \frac{\xi_i}{ \gamma(n) } \le x ) = (1 - \frac{ x^{-\alpha} }{n})^n = e^{-x^{-\alpha}}.

We’ve just shown that a random variable anξ+bna_n \xi + b_n converges to Frechet Type II EVD, where an=γ(n)a_n = \gamma(n) and bn=0b_n = 0.

Theorem 3.2: Von Mises sufficient condition for a distribution to belong to type III (Inverse Weibull) EVD

If a distribution function FξF_{\xi} has a finite end point xFx_F \le \infty and limxxF(xFx)r(x)=α\lim \limits_{x \to x_F} (x_F - x) r(x) = \alpha, then distribution FξF_{\xi} belongs to type III (Inverse Weibull).

Proof:

If our original random variable ξ\xi had a finite upper end xFx_F, let us consider a derived random variable η=1xFξ\eta = \frac{1}{x_F - \xi}.

η\eta approaches ++\infty as ξ\xi approaches upper end xFx_F and approached 0+0+ as ξ\xi approaches -\infty.

Let us look at the connection between c.d.f.s of η\eta and ξ\xi:

Fη(x)=p(ηx)=p(1xFξx)=p(1x(xFξ))=p(ξxF1x)=Fξ(xF1x)F_{\eta}(x) = p(\eta \le x) = p(\frac{1}{x_F - \xi} \le x) = p(\frac{1}{x} \le (x_F - \xi)) = p(\xi \le x_F - \frac{1}{x}) = F_{\xi}( x_F - \frac{1}{x} ).

Basically, with η\eta we created a mapping of ξ\xi onto a {0,+}\{0, +\infty\} domain. Suppose that random variable η\eta fits the conditions of Theorem 3.1:

xFη(x)1Fη(x)=xFξ(xF1x)1x21Fξ(xF1x)xα\frac{x F'_{\eta}(x)}{ 1 - F_{\eta}(x) } = \frac{x F'_{\xi}(x_F - \frac{1}{x}) \frac{1}{x^2} }{1 - F_{\xi}(x_F - \frac{1}{x})} \xrightarrow{x \to \infty} \alpha

Denote y=xF1xy = x_F - \frac{1}{x}, note that 1x=xFy\frac{1}{x} = x_F - y and substitute this into the previous result:

(xFy)Fξ(y)1Fξ(y)\frac{ (x_F -y) \cdot F'_{\xi}(y) }{1 - F_{\xi}(y)}

We came to the expression in the conditions of our theorem exactly, hence, (xFy)Fξ(y)1Fξ(y)yxFα \frac{ (x_F - y) \cdot F'_{\xi}(y) }{1 - F_{\xi}(y)} \xrightarrow{y \to x_F} \alpha.

I.e. if and only if the conditions of this theorem are satisfied, η\eta is in the domain of attraction of Type II.

Theorem 3.3: Von Mises sufficient condition for a distribution to belong to type I (Gumbel) EVD

If a distribution function FξF_{\xi} has a finite or infinite end point xFx_F \le \infty, then distribution FξF_{\xi} belongs to the domain of attraction of Type I (Gumbel) EVD if derivative of the hazard rate approaches zero r(u)=0r'(u) = 0 as uxFu \to x_F and hazard rate approaches a positive constant r(u)uxFconstr(u) \xrightarrow{u \to x_F} const.

Speaking informally, distribution of maximum converges to Gumbel, if the chances of death reach a plateau as uxFu \to x_F.

NOTE: I’ve seen ~5 different formulations and proofs of this von Mises condition. This version lacks generality (I’ll discuss generalizations later in this post), but is the easier to understand in my opinion. In this proof I am loosely following the logic of Smith and Weissman.

NOTE: You may run into a popular synonymous formulation of this theorem e.g. in Leadbetter or Resnick textbooks. They claim that distribution of maximum converges to Gumbel distribution if Q(x)=F(x)(1F(x))(F(x))2xxF1Q(x) =\frac{F''(x)(1 - F(x))}{(F'(x))^2} \xrightarrow{x \to x_F} -1.

This is an equivalent condition to r(x)xxF0r'(x) \xrightarrow{x \to x_F} 0 because:

r(x)=(F(x)1F(x))=F(x)11F(x)+F(x)F(x)(1F(x))2=F(x)(1F(x))+(F(x))2(1F(x))2r'(x) = (\frac{F'(x)}{1 - F(x)})' = F''(x) \frac{1}{1 - F(x)} + F'(x) \frac{F'(x)}{(1 - F(x))^2} = \frac{F''(x) (1 - F(x)) + (F'(x))^2}{(1 - F(x))^2}

Multiply this by (1F(x))2(F(x))2\frac{(1 - F(x))^2}{(F'(x))^2} and we come to r(x)(1F(x))2(F(x))2=Q(x)+1r'(x) \cdot \frac{(1 - F(x))^2}{(F'(x))^2} = Q(x) + 1, which implies that if r(x)0r'(x) \to 0 iff Q(x)1Q(x) \to -1, meaning that two formulations of the theorem are synonymous.

Proof:

Step 1

We shall start the proof from the end to motivate our mathematical manipulations.

In Steps 2+ we are going to show that von Mises condition entails that S(u+xg(u))S(u)=ex\frac{S(u + x g(u))}{S(u)} = e^{-x} as uxFu \to x_F, where g(u)=1r(u)g(u) = \frac{1}{r(u)} is the inverse of hazard rate (if introduction of this new entity g(u)g(u) feels redundant to you now, I agree, but trust me on this for now, it will be justified later). Assume this fact proven for now and let’s see how the result of the theorem follows.

As before we are going to use the tail quantile function γ(n)\gamma(n) to show that this ratio of survival functions converges to Gumbel distribution. Take u=γ(n)xFu = \gamma(n) \to x_F and substitute it into the ratio:

S(γ(n)+xg(γ(n)))S(γ(n))=S(γ(n)+xg(γ(n)))1n=nS(γ(n)+xg(γ(n)))=ex\frac{S(\gamma(n) + x g(\gamma(n)))}{S(\gamma(n))} = \frac{S(\gamma(n) + x g(\gamma(n)))}{\frac{1}{n}} = n S(\gamma(n) + x g(\gamma(n))) = e^{-x}.

Hence, S(γ(n)+xg(γ(n)))=1F(γ(n)+xg(γ(n)))=exnS(\gamma(n) + x g(\gamma(n))) = 1 - F(\gamma(n) + x g(\gamma(n))) = \frac{e^{-x}}{n},

F(γ(n)+xg(γ(n)))=1exnF(\gamma(n) + x g(\gamma(n))) = 1 - \frac{e^{-x}}{n} and

Fn(γ(n)+xg(γ(n)))=(1exn)n=eexF^n(\gamma(n) + x g(\gamma(n))) = (1 - \frac{e^{-x}}{n})^n = e^{-e^{-x}}.

Thus, p(Mnγ(n)+xg(γ(n)))=p(Mnγ(n)g(γ(n))x)=Fn(γ(n)+xg(γ(n)))=eexp(M_n \le \gamma(n) + x g(\gamma(n))) = p(\frac{M_n - \gamma(n)}{g(\gamma(n))} \le x) = F^n(\gamma(n) + x g(\gamma(n))) = e^{-e^{-x}}, leading us to the desired result, Gumbel distribution.

Step 2

Having motivated our interest in the ratio S(u+xg(u))S(u)\frac{S(u + x g(u))}{S(u)} in step 1, let us connect it to hazard rate and start showing the fact that this ratio converges to exe^{-x} as uxFu \to x_F.

Recall the connection between cumulative hazard function R(u)R(u), hazard rate r(t)r(t) and survival function S(u)S(u):

R(u)=ur(t)dt=lnS()=0lnS(u)R(u) = \int \limits_{-\infty}^{u} r(t) dt = \overbrace{\cancel{\ln S(-\infty)}}^{=0} - \ln S(u)

Hence, lnS(u+xg(u))lnS(u)=u1g(t)dtu+xg(u)1g(t)dt=uu+xg(u)1g(t)dt\ln S(u + x g(u)) - \ln S(u) = \int \limits_{-\infty}^{u} \frac{1}{g(t)} dt - \int \limits_{-\infty}^{u + x g(u)} \frac{1}{g(t)} dt = - \int \limits_{u}^{u + x g(u)} \frac{1}{g(t)} dt and

S(u+xg(u))S(u)=euu+xg(u)1g(t)dt\frac{S(u + x g(u))}{S(u)} = e^{- \int \limits_{u}^{u + x g(u)} \frac{1}{g(t)} dt}

Step 3

We need to show that euu+xg(u)1g(t)dtuxFexe^{- \int \limits_{u}^{u + x g(u)} \frac{1}{g(t)} dt} \xrightarrow{u \to x_F} e^{-x} and, hence, uu+xg(u)1g(t)dtuxFx\int \limits_{u}^{u + x g(u)} \frac{1}{g(t)} dt \xrightarrow{u \to x_F} x.

Perform a variable substitution s=tug(u)s = \frac{t - u}{g(u)}, t=u+sg(u)t = u + s g(u):

t=ut=u+xg(u)1g(t)dt=s=uuf(u)=0s=u+xg(u)ug(u)=x1g(u+sg(u))d(u+sg(u))=s=0xg(u)g(u+sg(u))ds\int \limits_{t=u}^{t = u + x g(u)} \frac{1}{g(t)} dt = \int \limits_{s=\frac{u-u}{f(u)}=0}^{s=\frac{u + x g(u) - u}{g(u)}=x} \frac{1}{g(u + s g(u))} d(u + s g(u)) = \int \limits_{s=0}^{x} \frac{g(u)}{g(u + s g(u))} ds

We need to show that integrand g(u)g(u+sg(u))\frac{g(u)}{g(u + s g(u))} uniformly converges to 11 as uxFu \to x_F in order to show that integral approximately equals xx because then 0xg(u)g(u+sg(u))ds0x1ds=x\int \limits_{0}^{x} \frac{g(u)}{g(u + s g(u))} ds \to \int \limits_{0}^{x} 1 ds = x.

Uniform convergence implies that we can choose an arbitrarily small tolerance ϵ\epsilon, so that there would exist some point uϵu_{\epsilon}, such that for every value of ss inequality g(u)g(u+sg(u))1ϵ| \frac{g(u)}{g(u + s g(u))} - 1 | \le \epsilon holds for every point uu, such that uϵuxFu_{\epsilon} \le u \le x_F, and, hence, 0xg(u)g(u+sg(u))ds0x1dsϵx|\int \limits_{0}^{x} \frac{g(u)}{g(u + s g(u))} ds - \int \limits_{0}^{x} 1 ds| \le \epsilon x.

Step 4

We need to work out the approximation of g(u)g(u+sg(u))\frac{g(u)}{g(u + s g(u))} ratio.

Recall that g(u)g(u+sg(u))=r(u+sg(u))r(u)\frac{g(u)}{g(u + s g(u))} = \frac{r(u + sg(u))}{r(u)}.

Consider the Taylor series for r(u+sg(u))r(u + sg(u)) around the point uu:

r(u+sg(u))=r(u)+r(u)sg(u)+O(s2g2(u))r(u + s g(u)) = r(u) + r'(u) s g(u) + O(s^2 g^2(u))

We know that r(u)constr(u) \to const (hence, g(u)constg(u) \to const) and r(u)0r'(u) \to 0, hence, r(u+sg(u))ur(u)+r(u)sg(u)+O(s2g2(u))r(u + s g(u)) \xrightarrow{u \to \infty} r(u) + \cancel{r'(u) s g(u)} + O(s^2 g^2(u)).

Second-order term O(s2g2(u))O(s^2 g^2(u)) can be kept small enough for large uu, if we consider reasonably small ss.

Then our ratio g(u)g(u+sg(u))=r(u+sg(u))r(u)r(u)r(u)1\frac{g(u)}{g(u + s g(u))} = \frac{r(u + sg(u))}{r(u)} \to \frac{r(u)}{r(u)} \to 1.

Then 0xg(u)g(u+sg(u))ds0x1ds=x\int \limits_{0}^{x} \frac{g(u)}{g(u + s g(u))} ds \approx \int \limits_{0}^{x} 1 ds = x

Generalizations of Theorem 3.3: auxiliary function and von Mises function

As I said, there are multiple alternative formulations and proofs of von Mises conditions. Some use more generic notions of auxiliary function and von Mises function.

The general necessary and sufficient conditions in the next part of this post generalize these two notions. Hence, it makes sense to discuss them here.

Definition 3.5: Auxiliary function

In the previous proof we denoted the inverse hazard rate as g(x)=1r(x)g(x) = \frac{1}{r(x)}. This quantity g(x)g(x), which is called auxiliary funcion, is defined in the context of the ratio S(u+xg(u))S(u)\frac{S(u + x g(u))}{S(u)} and is chosen in such a way that S(u+xg(u))S(u)uxFex\frac{S(u + x g(u))}{S(u)} \xrightarrow{u \to x_F} e^{-x}.

However, turns out that it is not uniquely defined. We can come up with other definitions of g(x)g(x) and convergence to Gumbel EVD would still hold.

To motivate other choices of auxiliary function let us discuss the interpretation of this ratio of survival functions. We can treat it as a conditional probability. The denominator represents the fraction of population that survived by the time uu close to the maximum lifespan. The numerator is the probability to survive for a (bit) longer period of time xg(u)x g(u), where auxiliary function g(u)g(u) can be seen as some kind of normalization constant.

Then our ratio can be interpreted as conditional probability p(ξug(u)>xξ>u)p( \frac{\xi - u}{g(u)} > x | \xi > u). Basically it is the chance to live xg(u)x g(u) years longer among those who already survived uu years.

Consider a different popular choice of auxiliary function: g(x)=txFS(x)dxS(t)g(x) = \frac{\int \limits_{t}^{x_F} S(x) dx }{ S(t) }. What interpretation can we give to it?

Recall that the integral of survival function is average lifespan. Hence, txFS(x)dxS(t)\frac{\int \limits_{t}^{x_F} S(x) dx }{ S(t) } is conditional expectation of longevity among those, who survived by the moment of time tt: the denominator is the fraction of survivors by tt, while the numerator is the average life expectancy of these survivors after moment of time tt.

Definition 3.6: von Mises function

Our proof is based on the fact that S(x)=exr(u)duS(x) = e^{-\int \limits_{-\infty}^{x} r(u) du}.

However, it would hold if we could represent the survival function as S(x)=cex1g(u)duS(x) = c e^{-\int \limits_{-\infty}^{x} \frac{1}{g(u)} du} with any auxiliary function gg, not necessarily g(u)=1r(u)g(u) = \frac{1}{r(u)}.

Hence, a survival function is called a von Mises function, if it can be represented as S(x)=cez0x1g(u)duS(x) = c e^{-\int \limits_{z_0}^{x} \frac{1}{g(u)} du}, where g(u)>0g(u) > 0 is an absolutely continuous auxiliary function with density g(u)g'(u), such that limux0g(u)=0\lim \limits_{u \to x_0} g'(u) = 0 and z0<u<x0z_0 < u < x_0 (basically, z0z_0 is a lower end point which does not have to be -\infty or 0).


4. Necessary and sufficient conditions for a distribution to belong to a type I, II or III

Now that we’ve introduced von Mises conditions, it is quite easy to slightly alter them in order to show that their modified versions are not only sufficient, but also necessary.

In order to do that we have to rely on mathematical apparatus of slowly varying functions and their extensions, developed by a Serbian/Yugoslavian mathematician Jovan Karamata in 1930s. Using Karamata theorems we can relatively easily prove that generalizations of von Mises conditions are not only sufficient, but necessary.

Karamata

Jovan Karamata

Pre-requisites from Karamata’s theory of slow/regular/Г-/П- variation

Here we’ll introduce the concepts of slow-varying function and regularly varying function from Karamata theory, which allow us to prove the necessary and sufficient conditions for Types II and III.

They also serve as a foundation for the concept of extended variation, which enables the proof of the sufficient conditions for Type I.

Definition 4.1: Slow-varying function

Slow-varying function l(x)l(x) is such a function that limtl(xt)l(t)=1\lim \limits_{t \to \infty} \frac{l(xt)}{l(t)} = 1 for any x>0x > 0.

Example 4.1: logarithm is a slow-varying function

lnx\ln x is a slowly varying function because limtln(xt)ln(t)=lnx+lntlnt=lnxlnt+11\lim \limits_{t \to \infty} \frac{\ln (xt)}{\ln (t)} = \frac{\ln x + \ln t}{\ln t} = \frac{\ln x}{\ln t} + 1 \to 1

Definition 4.2: Regularly varying function

Regularly varying function h(x)h(x) of index β\beta is such a function that limth(xt)h(t)=xβ\lim \limits_{t \to \infty} \frac{h(xt)}{h(t)} = x^{\beta} for any x>0x > 0, where β\beta is a real number, sometimes called exponent of variation or index of variation.

Regularly-varying function is basically just a generalization of slow-varying function.

Later on we will show that if the survival function of our distribution in question is regularly-varying, its maximum converges to Type III Weibull EVD, if it has a finite upper end point or to Type II Frechet EVD, if its upper end point is infinite.

Example 4.2: power function is a regularly-varying function

xβx^{\beta} is a regularly varying function because limt(xt)βtβ=xβ\lim \limits_{t \to \infty} \frac{(xt)^\beta}{t^\beta} = x^\beta.

Lemma 4.1: Karamata’s theorem

Direct statement. Suppose hh is a regularly varying function with an order of variation β>1\beta > -1.

Then its integral 0xh(u)du\int \limits_{0}^{x} h(u) du is a regularly varying function with order of variation β+1\beta + 1.

Ratio statement limxxh(x)0xh(u)du=β+1\lim \limits_{x \to \infty} \frac{x h(x)}{\int \limits_{0}^{x} h(u)du} = \beta + 1.

Reverse statement. Conversely, if limxxh(x)0xh(u)du=β\lim \limits_{x \to \infty} \frac{x h(x)}{\int \limits_{0}^{x} h(u)du} = \beta, then hh is regularly varying with order of variation β1\beta - 1.

Proof of direct statement

We need to show that H(tx)=s=0txh(s)dsH(tx) = \int \limits_{s=0}^{tx \to \infty} h(s) ds is a regularly varying function, i.e. s=0txh(s)dss=0th(s)dsβ+1\frac{\int \limits_{s=0}^{tx \to \infty} h(s) ds}{\int \limits_{s=0}^{t \to \infty} h(s) ds} \to \beta + 1.

Step 1. Integral of regularly varying function is unbounded

Upon β>1\beta > -1 we know that h(2t)h(t)t>t02β>21=12\frac{h(2t)}{h(t)} \xrightarrow{t > t_0} 2^{\beta} > 2^{-1} = \frac{1}{2}.

We can use this fact to show that integral s0h(t)dt\int \limits_{s_0}^{\infty} h(t)dt \to \infty.

Imagine that we split the tt line into intervals of ever doubling length: [1,2][1, 2], [2,4][2, 4], [4,8][4, 8] etc. Starting from some point s0s_0 we assume that the property h(2s0)h(s0)>12\frac{h(2s_0)}{h(s_0)} > \frac{1}{2} starts to hold.

Let us pick some NN, such that 2N>s02^N > s_0.

Express t=2N+12N+2h(t)dt\int \limits_{t=2^{N+1}}^{2^{N+2}} h(t) dt through t=2N2N+1h(t)dt\int \limits_{t=2^N}^{2^{N+1}} h(t) dt:

t=2N+12N+2h(t)dt=2t=2N2N+1h(2t)d2t=22t=2N2N+1h(2t)dt>2t=2N2N+1h(t)12dt=t=2N2N+1h(t)dt\int \limits_{t=2^{N+1}}^{2^{N+2}} h(t) dt = \int \limits_{2t=2^{N}}^{2^{N+1}} h(2t) d2t = 2 \int \limits_{2t=2^{N}}^{2^{N+1}} h(2t) dt > 2 \int \limits_{t=2^{N}}^{2^{N+1}} h(t) \cdot \frac{1}{2} \cdot dt = \int \limits_{t=2^{N}}^{2^{N+1}} h(t) dt.

Repeating this procedure infinitely for [2N+2,2N+3],[2N+3,2N+4],...[2^{N+2}, 2^{N+3}], [2^{N+3}, 2^{N+4}], ... intervals, we get t=2N+12N+2h(t)dt=t=2N2N+1h(t)dt\int \limits_{t=2^{N+1}}^{2^{N+2}} h(t) dt = \infty \cdot \int \limits_{t=2^{N}}^{2^{N+1}} h(t) dt.

Hence, s0h(t)dt\int \limits_{s_0}^{\infty} h(t)dt \to \infty.

Step 2. Squeeze conditions

Assuming uniform convergence of h(tx)h(tx) to h(t)xβh(t) x^\beta as tt \to \infty: for arbitrarily small ϵ\epsilon there is tt, such that:

h(tx)xβh(t)1ϵ|\frac{h(tx)}{x^\beta h(t)} - 1| \le \epsilon

h(tx)xβh(t)ϵxβh(t)|h(tx) - x^\beta h(t)| \le \epsilon x^\beta h(t)

ϵxβh(t)h(tx)xβh(t)ϵxβh(t)-\epsilon x^\beta h(t) \le h(tx) - x^\beta h(t) \le \epsilon x^\beta h(t)

xβh(t)ϵxβh(t)h(tx)xβh(t)+ϵxβh(t)x^\beta h(t) - \epsilon x^\beta h(t) \le h(tx) \le x^\beta h(t) + \epsilon x^\beta h(t)

(1x)βh(t)h(tx)(1+ϵ)xβh(t)(1 - x)^\beta h(t) \le h(tx) \le (1 + \epsilon) x^\beta h(t)

Step 3. Application of squeeze conditions

lim supt0txh(s)ds0th(s)ds\limsup \limits_{t \to \infty} \frac{ \int \limits_0^{tx} h(s)ds }{ \int \limits_0^{t} h(s)ds } =lim suptx0th(sx)ds0th(s)ds = \limsup \limits_{t \to \infty} \frac{ x \int \limits_0^{t} h(sx)ds }{ \int \limits_0^{t} h(s)ds } =lim suptxNth(sx)dsNth(s)ds = \limsup \limits_{t \to \infty} \frac{ x \int \limits_N^{t} h(sx)ds }{ \int \limits_N^{t} h(s)ds } lim suptxβ+1(1+ϵ)Nth(s)dsNth(s)ds \le \limsup \limits_{t \to \infty} x^{\beta + 1} (1 + \epsilon) \frac{ \int \limits_N^{t} h(s)ds }{ \int \limits_N^{t} h(s)ds } =xβ+1(1+ϵ) = x^{\beta + 1} (1 + \epsilon).

Step 4. Special case: β=1\beta = -1

Here we must consider two cases.

If H()=0h(x)dx<H(\infty) = \int \limits_0^\infty h(x) dx < \infty, in which case H(x)H(x) is a slow-varying function, as H(tx)H(t)t1\frac{H(tx)}{H(t)} \xrightarrow{t \to \infty} 1 (or, one might say H(x)H(x) is a regularly varying function with index of variation 00).

Otherwise (if H()=0h(x)dx=H(\infty) = \int \limits_0^\infty h(x) dx = \infty), the same reasoning as above applies.

Proof of ratio statement

We need to show that limxxh(x)0xh(u)du=β+1\lim \limits_{x \to \infty} \frac{x h(x)}{\int \limits_{0}^{x} h(u)du} = \beta + 1.

Step 1. Express h(x) through b(x)

Denote b(x)=xh(x)0xh(t)dtb(x) = \frac{x h(x)}{\int \limits_{0}^{x} h(t) dt}.

Express h(x)h(x) through b(x)b(x) (see the derivation in reverse statement and Karamata’s representation below):

h(x)=cb(x)xe1xb(u)uduh(x) = c \cdot \frac{ b(x) }{ x } \cdot e^{\int \limits_{1}^{x} \frac{ b(u) }{u} du}.

Step 2. Consider inverse of b(x), do a variable change

We must show b(x)β+1b(x) \to \beta + 1. Consider the lower limit of inverse of b(x)b(x):

lim infx1/b(x)=lim infx0xh(t)dtxh(x)\liminf \limits_{x \to \infty} 1 / b(x) = \liminf \limits_{x \to \infty} \frac{ \int \limits_{0}^{x} h(t) dt }{x h(x)}

Perform a variable substitution s=txs = \frac{t}{x}:

lim infx0xh(t)dtxh(x)=lim infx01h(sx)dsh(x)\liminf \limits_{x \to \infty} \frac{ \int \limits_{0}^{x} h(t) dt }{x h(x)} = \liminf \limits_{x \to \infty} \frac{ \int \limits_{0}^{1} h(sx) ds }{ h(x) }

Step 3. Application of Fatou lemma

Apply Fatou’s lemma:

lim infx01h(sx)dth(x)01lim infxh(sx)h(x)ds=01sβds=1β+1\liminf \limits_{x \to \infty} \frac{ \int \limits_{0}^{1} h(sx) dt }{ h(x) } \ge \int \limits_{0}^{1} \liminf \limits_{x \to \infty} \frac{ h(sx) }{ h(x) } ds = \int \limits_{0}^{1} s^{\beta} ds = \frac{1}{ \beta + 1 }

This leads to a conclusion lim supxb(x)β+1\limsup \limits_{x \to \infty} b(x) \le \beta + 1.

Step 4. Final analysis

b(x)b(x) has the following properties:

(i) As we’ve just shown b(x)b(x) is bounded by β+1\beta + 1 as xx \to \infty;

(ii) bb is slowly varying since xU(x)xU(x) is regularly varying with index of variation β+1{\beta + 1} and 0xU(s)ds\int \limits_0^x U(s)ds is regularly varying with index β+1\beta + 1;

(iii) b(xt)b(x)0b(xt) - b(x) \to 0 boundedly as xx \to \infty.

The last statement follows since by slow variation:

limx(b(xt)b(x))/b(x)=0\lim \limits_{x \to \infty} (b(xt) - b(x)) / b(x) = 0

and the denominator is ultimately bounded by (i).

From (iii) and dominated convergence

limx1sb(xt)b(x)tdt=0\lim \limits_{x \to \infty} \int \limits_{1}^{s} \frac{ b(xt) - b(x) }{t} dt = 0

and the left side may be rewritten to obtain

limx(1sb(xt)tdtb(x)lns)=0\lim \limits_{x \to \infty} (\int \limits_{1}^{s} \frac{ b(xt) }{t}dt - b(x) \ln s) = 0

From h(x)=cb(x)xe1xb(u)uduh(x) = c \cdot \frac{b(x) }{ x } \cdot e^{\int \limits_{1}^{x} \frac{ b(u) }{u} du} and b(x)=xh(x)0xh(s)dsb(x) = \frac{x h(x)}{ \int \limits_{0}^{x}h(s)ds }:

cexp(1xb(t)tdt)=0xh(s)dsc \cdot exp(\int \limits_1^x \frac{b(t)}{t} dt) = \int \limits_0^x h(s) ds, which is a regularly varying function with index of variation β+1\beta + 1

Apply regular variation property to regularly varying function H(x)=0xh(s)dsH(x) = \int \limits_0^x h(s) ds with index of variation β+1\beta + 1:

sβ+1=H(sx)H(x)s^{\beta + 1} = \frac{H(sx)}{H(x)}

Apply natural logarithm to both sides:

(β+1)lns=limxln0xsh(t)dt0xh(t)dt(\beta + 1) \ln s = \lim \limits_{x \to \infty} \ln \frac{ \int \limits_0^{xs} h(t) dt }{ \int \limits_0^{x} h(t) dt }

Now recall that, as we’ve show above H(x)=0xh(s)ds=cexp(1xb(t)tdt)H(x) = \int \limits_0^x h(s) ds = c \cdot exp(\int \limits_1^x \frac{b(t)}{t} dt) and substitute this for H(x)H(x):

(β+1)lns=limxln(ce0xsb(t)tdtce0xb(t)tdt)=limxlne0xsb(t)tdt0xb(t)tdt=limxxxsb(t)tdt=limx1sb(xt)tdt(\beta + 1) \ln s = \lim \limits_{x \to \infty} \ln ( \frac{ c e^{ \int \limits_{0}^{xs} } \frac{b(t)}{t} dt }{ c e^{ \int \limits_{0}^{x} \frac{b(t)}{t} dt } } ) = \lim \limits_{x \to \infty} \ln e^{\int \limits_{0}^{xs} \frac{b(t)}{t} dt - \int \limits_{0}^{x} \frac{b(t)}{t} dt } = \lim \limits_{x \to \infty} \int \limits_x^{xs} \frac{b(t)}{t} dt = \lim \limits_{x \to \infty} \int \limits_1^s \frac{b(xt)}{t} dt

and combining this with limx(1sb(xt)tdtb(x)lns)=0\lim \limits_{x \to \infty} (\int \limits_{1}^{s} \frac{ b(xt) }{t}dt - b(x) \ln s) = 0 leads to the desired conclusion that b(x)β+1b(x) \to \beta + 1.

Proof of reverse statement

We need to show that if b(x)=xh(x)0xh(t)dt b(x) = \frac{x h(x)}{\int \limits_{0}^{x} h(t) dt}, and we know that limxb(x)=β\lim \limits_{x \to \infty} b(x) = \beta, then h(xt)h(t)=β1\frac{h(xt)}{h(t)} = \beta - 1.

Consider the ratio b(x)x=h(x)0xh(t)dt\frac{b(x)}{x} = \frac{h(x)}{\int \limits_{0}^{x} h(t) dt} and integrate it, using the fact that integral of the right part is a logarithm:

1xb(u)udu=1xh(u)t=0uh(t)dtdu=1xdln(0uh(t)dt)=ln(0xh(u)du)ln(01h(u)du)\int \limits_{1}^{x} \frac{b(u)}{u} du = \int \limits_{1}^{x} \frac{h(u)}{\int \limits_{t=0}^{u} h(t)dt} du = \int \limits_1^{x} d \ln(\int \limits_0^{u}h(t) dt) = \ln (\int \limits_{0}^{x} h(u) du) - \ln(\int \limits_{0}^{1} h(u) du).

e1xb(u)udu=0xh(u)du/01h(u)du=xh(x)b(x)/01h(u)due^{\int \limits_{1}^{x} \frac{b(u)}{u} du} = \int \limits_{0}^{x} h(u) du / \int \limits_{0}^{1} h(u) du = \frac{ x h(x) } { b(x) } / \int \limits_0^1 h(u) du

h(x)=01h(u)dub(x)xe1xb(u)udu=01h(u)dub(x)eln(1x)e1xb(u)udu=01h(u)dub(x)eln(x)+ln(1)e1xb(u)udu=h(x) = \int \limits_0^1 h(u) du \cdot \frac{ b(x) }{x} \cdot e^{\int \limits_{1}^{x} \frac{ b(u) }{u} du} = \int \limits_0^1 h(u) du \cdot b(x) \cdot e^{\ln (\frac{1}{x}) } \cdot e^{\int \limits_{1}^{x} \frac{ b(u) }{u} du} = \int \limits_0^1 h(u) du \cdot b(x) \cdot e^{-\ln (x) + \ln(1)} \cdot e^{\int \limits_{1}^{x} \frac{ b(u) }{u} du} =

=01h(u)dub(x)e1x1udue1xb(u)udu=01h(u)dub(x)e1xb(u)1udu = \int \limits_0^1 h(u) du \cdot b(x) \cdot e^{-\int \limits_1^x \frac{1}{u} du} \cdot e^{\int \limits_{1}^{x} \frac{ b(u) }{u} du} = \int \limits_0^1 h(u) du \cdot b(x) \cdot e^{\int \limits_{1}^{x} \frac{ b(u) - 1 }{u} du}.

We see that h(x)h(x) is represented as a regularly varying function with a variation index of β1\beta - 1, because:

h(x)=01h(u)dub(x)xconste1xb(u)1xβ1udu=c(x)e1xB(u)uduh(x) = \underbrace{\int \limits_0^1 h(u) du \cdot b(x) }_{ \xrightarrow{x \to \infty } const } \cdot e^{\int \limits_{1}^{x} \frac{ \overbrace{b(u) - 1}^{ \xrightarrow{x \to \infty } \beta - 1} }{u} du} = c(x) e^{\int \limits_{1}^{x} \frac{B(u)}{u} du}, where B(u)uβ1B(u) \xrightarrow{u \to \infty} \beta - 1.

This is Karamata’s representation/characterization of a regularly varying function with index of variation β1\beta - 1 by the following Karamata’s representation and characterization theorems.

Lemma 4.2: Karamata’s representation theorem

Function LL is slowly varying if and only if it can be represented as L(x)=c(x)exp(1xϵ(t)tdt)L(x) = c(x) exp({\int \limits_1^x \frac{\epsilon(t)}{t}dt}).

Here both c(x)c(x) and ϵ(x)\epsilon(x) are functions, defined on R+\mathbb{R}^+ and taking non-negative values and:

limxc(x)=c(0,)\lim \limits_{x \to \infty} c(x) = c \in (0, \infty) (approaches a constant cc)

limxϵ(x)=0\lim \limits_{x \to \infty} \epsilon(x) = 0

Proof:

Direct result

L(tx)L(x)=c(tx)e1txϵ(u)uduc(t)e1tϵ(u)udu=c(tx)c(t)ettxϵ(u)udu\frac{L(tx)}{L(x)} = \frac{c(tx) e^{\int \limits_1^{tx} \frac{\epsilon(u)}{u} du }}{c(t) e^{\int \limits_{1}^{t} \frac{\epsilon(u)}{u} du } } = \frac{c(tx)}{c(t)} e^{\int \limits_t^{tx} \frac{ \epsilon(u) }{u} du}.

Now take uu large enough that ϵ(u)ϵ|\epsilon(u)| \le \epsilon for arbitrarily small ϵ>0\epsilon > 0. Then:

1xϵeϵ(ln(tx)lnt)ettxϵ(u)udueϵ(ln(tx)lnt)=xϵ11 \leftarrow x^{-\epsilon} \le e^{-\epsilon (\ln(tx) -\ln t)} \le e^{\int \limits_t^{tx} \frac{ \epsilon(u) }{u} du} \le e^{\epsilon (\ln(tx) -\ln t)} = x^{\epsilon} \to 1

And with c(tx)c(t)1\frac{c(tx)}{c(t)} \to 1 the result is proven.

Conversely

b(x)=xL(x)0xL(s)ds1b(x) = \frac{x L(x)}{\int \limits_0^{x} L(s) ds} \to 1 as xx \to \infty

L(x)=b(x)0xL(s)dsxL(x) = \frac{b(x) \int \limits_0^{x} L(s) ds}{x}

Denote ϵ(x)=b(x)10\epsilon(x) = b(x) - 1 \to 0 and consider:

1xϵ(t)tdt=1x(L(t)0tL(s)ds1t)dt=1xL(t)0tL(s)dsdt(lnxln1)=1xd(ln0tL(s)ds)lnx=ln0xL(s)ds01L(s)dslnx\int \limits_{1}^{x} \frac{ \epsilon(t) }{t} dt = \int \limits_{1}^{x} (\frac{L(t)}{\int \limits_0^{t} L(s) ds} - \frac{1}{t}) dt = \int \limits_{1}^{x} \frac{L(t)}{\int \limits_0^{t} L(s) ds} dt - (\ln x - \ln 1) = \int \limits_{1}^{x} d( \ln \int \limits_0^t L(s) ds ) - \ln x = \ln \frac{ \int \limits_0^x L(s) ds }{ \int \limits_0^1 L(s) ds } - \ln x

e1xϵ(t)tdt=0xL(s)dsx01L(s)ds=L(x)b(x)01L(s)dse^{\int \limits_{1}^{x} \frac{ \epsilon(t) }{t} dt} = \frac{ \int \limits_0^x L(s) ds }{ x \int \limits_0^1 L(s) ds } = \frac{L(x)}{b(x) \int \limits_0^1 L(s) ds}

L(x)=b(x)(01L(s)ds)c(x)e1xϵ(t)tdtL(x) = \underbrace{b(x) (\int \limits_0^{1} L(s) ds)}_{c(x)} \cdot e^{\int \limits_{1}^{x} \frac{ \epsilon(t) }{t} dt}

Which leads to c(x)=b(x)01L(s)dsc(x) = b(x) \int \limits_0^1 L(s) ds and L(x)=c(x)e1xϵ(t)tdtL(x) = c(x) \cdot e^{\int \limits_{1}^{x} \frac{ \epsilon(t) }{t} dt}.

Lemma 4.3: Karamata’s characterization theorem

Every regularly varying function h:(0,+)(0,+)h: (0, +\infty) \to (0, +\infty) can be expressed through some slow-varying function ll as: h(x)l(x)=xβ\frac{h(x)}{l(x)} = x^{\beta}, where β\beta is a real number.

Alternative formulation: function hh is regularly varying with index of variation β\beta if and only if it can be represented as h(x)=c(x)exp(1xb(t)tdt)h(x) = c(x) exp({\int \limits_1^x \frac{b(t)}{t}dt}), where b(t)tβb(t) \xrightarrow{t \to \infty} \beta and c(x)xcc(x) \xrightarrow{x \to \infty} c.

Proof:

Direct result

Upon xx \to \infty we have h(x)=c(x)exp(1xb(t)tdt)xcexp(lnxβb(t)tdt1)=cxβeb(t)tdt1h(x) = c(x) exp( \int \limits_1^{x} \frac{b(t)}{t}dt ) \xrightarrow{x \to \infty} c \cdot exp(\ln x \cdot \beta - \int \frac{b(t)}{t} dt |_1) = \frac{ c \cdot x^\beta }{ e^{\int \frac{b(t)}{t} dt|_1} }

h(tx)h(t)=c(tx)exp(1txb(t)tdt)c(x)exp(1xb(t)tdt)xtxβtβ=xβ\frac{h(tx)}{h(t)} = \frac{ c(tx) exp({\int \limits_1^{tx} \frac{b(t)}{t}dt}) }{ c(x) exp({\int \limits_1^x \frac{b(t)}{t}dt}) } \xrightarrow{x \to \infty} \frac{ {tx}^\beta }{ t^\beta } = x^\beta, which makes hh a regularly varying function by definition.

As for representation of h(x)h(x) through a slowly varying l(x)l(x), we can take l(x)=c(x)exp(0xb(t)βtdt)l(x) = c(x) exp(\int \limits_0^x \frac{b(t) - \beta}{t} dt), which is obviously approaching cc as xx \to \infty, so that h(x)=xβl(x)h(x) = x^\beta l(x).

Conversely

Suppose that l(x)l(x) is a slowly varying function. By Karamata’s representation l(x)=b(x)01l(s)dse1xϵ(t)tdtl(x) = b(x) \int \limits_0^1 l(s) ds \cdot e^{ \int \limits_1^x \frac{\epsilon(t)}{t} dt } or just l(x)=c(x)e1xϵ(t)tdtl(x) = c(x) \cdot e^{ \int \limits_1^x \frac{\epsilon(t)}{t} dt }.

By theorem’s assumption h(x)=xβl(x)=eβ(lnx)c(x)e1xϵ(t)tdt=e1xβtdtc(x)e1xϵ(t)tdt=c(x)e1xϵ(t)+βtdth(x) = x^\beta l(x) = e^{\beta(\ln x)} c(x) \cdot e^{ \int \limits_1^x \frac{\epsilon(t)}{t} dt } = e^{ \int \limits_1^x \frac{\beta}{t} dt } c(x) \cdot e^{ \int \limits_1^x \frac{\epsilon(t)}{t} dt } = c(x) \cdot e^{ \int \limits_1^x \frac{\epsilon(t) + \beta}{t} dt }.

We can assume that β(t)=ϵ(t)+β\beta(t) = \epsilon(t) + \beta is a function that converges to β\beta as xx \to \infty.

Lemma 4.4: Reciprocal of regularly varying function

If h(x)h(x) is non-decreasing regularly varying function with index of variation ρ\rho, 0ρ0 \le \rho \le \infty and h()=h(\infty) = \infty, then hh^{\leftarrow} is a regularly varying function with index of variation 1ρ\frac{1}{\rho}.

Proof:

h(tx)h(t)txρ\frac{h(tx)}{h(t)} \xrightarrow{t \to \infty} x^{\rho}

Take t=h(s)t = h^{\leftarrow}(s) (hence, for tt \to \infty we need to make sure that h(s)h^{\leftarrow}(s) \to \infty). Then:

h(h(s)x)h(h(s))txρ\frac{h(h^{\leftarrow}(s)x)}{h(h^{\leftarrow}(s))} \xrightarrow{t \to \infty} x^{\rho}

h(h(s)x)sxρh(h^{\leftarrow}(s)x) \to s \cdot x^\rho

Apply hh^{\leftarrow} to both sides:

h(s)xh(sxρ)h^{\leftarrow}(s) \cdot x \to h^{\leftarrow}(s \cdot x^\rho)

h(sxρ)h(s)x\frac{h^{\leftarrow}(s\cdot x^\rho)}{h^{\leftarrow}(s)} \to x

Denote y=xρy = x^\rho:

h(sy)h(s)y1/ρ\frac{h^{\leftarrow}(sy)}{h^{\leftarrow}(s)} \to y^{1/\rho}, which is the definition of regularly varying function with index of variation 1/ρ1/\rho upon ss \to \infty.

Lemma 4.5: Regularly varying function with a positive index of variation is unbounded

If hh is a regularly varying function with index of variation β>0\beta > 0, limxh(x)=\lim \limits_{x \to \infty} h(x) = \infty.

Proof:

Apply Karamata’s characterization of h(x)h(x):

h(x)=c(x)e1xβ(t)tdth(x) = c(x) e^{\int \limits_1^x \frac{\beta(t)}{t} dt}, where c(x)cc(x) \to c and β(t)β\beta(t) \to \beta.

Consider the ratio h(x)lnx=lnc(x)+1xβ(t)tdt1x1tdtxβ\frac{h(x)}{\ln x} = \frac{\ln c(x) + \int \limits_1^x \frac{\beta(t)}{t} dt}{\int \limits_1^x \frac{1}{t} dt} \xrightarrow{x \to \infty} \beta.

Hence, h(x)xh(x) \xrightarrow{x \to \infty} \infty.

Lemma 4.6: Uniform convergence of regularly varying function ratio

Given a regularly varying function hh with index of variation β\beta, there exists t0t_0 such that for arbitrarily small ϵ>0\epsilon > 0, x1x \ge 1 and tt0t \ge t_0:

(1ϵ)xβϵ<h(tx)h(t)<(1+ϵ)xβ+ϵ(1 - \epsilon) x^{\beta - \epsilon} < \frac{h(tx)}{h(t)} < (1 + \epsilon) x^{\beta + \epsilon} .

Proof:

From Karamata’s representation: h(tx)h(t)=c(tx)c(t)exp(1txβ(s)sds1tβ(s)sds)teβ(lnxtln1)β(lntln1)=elnx=xβ\frac{h(tx)}{h(t)} = \frac{c(tx)}{c(t)} exp(\int \limits_1^{tx} \frac{\beta(s)}{s} ds - \int \limits_1^{t} \frac{\beta(s)}{s} ds) \xrightarrow{t \to \infty} e^{\beta (\ln xt - \ln 1) - \beta (\ln t - \ln 1)} = e^{\ln x} = x^{\beta}.

Change the integration variable and upper limit of integration in the first integral:

exp(1txβ(s)sds1tβ(s)sds)=exp(1tβ(sx)sxd(sx)1tβ(s)sds)exp(\int \limits_1^{tx} \frac{\beta(s)}{s} ds - \int \limits_1^{t} \frac{\beta(s)}{s} ds) = exp(\int \limits_1^{t} \frac{\beta(sx)}{s\cancel{x}} d(s\cancel{x}) - \int \limits_1^{t} \frac{\beta(s)}{s} ds)

Pick tt large enough, so that:

(1ϵ)<c(tx)c(t)<(1+ϵ)(1 - \epsilon) < \frac{c(tx)}{c(t)} < (1 + \epsilon)

ϵ<β(sx)β(s)<ϵ\epsilon < \beta(sx) - \beta(s) < \epsilon

Thus, h(tx)h(t)<(1+ϵ)e(β+ϵ)lnxtβlnt=(1+ϵ)eβlnx+ϵlnx+ϵlnt=(1+ϵ)xβ+ϵ\frac{h(tx)}{h(t)} < (1+\epsilon) e^{(\beta + \epsilon) \ln xt - \beta \ln t} = (1+\epsilon) e^{\beta \ln x + \epsilon \ln x + \cancel{\epsilon \ln t}} = (1 + \epsilon) x^{\beta + \epsilon} (we assume that ϵ\epsilon is small enough that ϵlnt0\epsilon \ln t \approx 0).

Similarly we derive the lower bound.

Necessary and sufficient conditions of convergence to Types II or III EVD

Theorem 4.1: necessary and sufficient conditions for a distribution to belong to a type II (Frechet) EVD

A distribution belongs to Extreme Value Distribution type II (Frechet) domain of attraction if and only if xF=x_F = \infty and limtS(tx)S(t)=xα\lim \limits_{t \to \infty} \frac{S(tx)}{S(t)} = x^{-\alpha}, where α>0\alpha > 0 and x>0x > 0.

Proof of direct statement:

Choose tt \to \infty such that S(t)=1nS(t) = \frac{1}{n}. The proof immediately follows from the definition of regularly varying function:

FMn(tx)=p(Mntx)=Fξ(tx)n=p(ξitx)n=(1p(ξitx))n=(1S(tx))n(1xαS(t))n=(1xα1n)n=exαF_{M_n}(tx) = p(M_n \le tx) = F_\xi(tx)^n = p(\xi_i \le tx)^n = (1 - p(\xi_i \ge tx))^n = (1 - S(tx))^n \to (1 - x^{-\alpha} S(t))^n = (1 - x^{-\alpha} \cdot \frac{1}{n})^n = e^{-{x^{-\alpha}}}

Proof of converse statement:

This proof of converse statement is incredibly clumsy and technical. I really wish I knew a better proof than this one, taken from Resnick textbook, with extra commentaries and clarifications added by me.

Step 1: re-formulate the problem in terms of tail quantile function γ\gamma

Fn(anx+bn)nexαF^n(a_n x + b_n) \xrightarrow{n \to \infty} e^{-x^{-\alpha}}.

(1nS(anx+bn)b)nenSexα(1 - \frac{n S(a_n x + b_n)}{b})^n \approx e^{-nS} \to e^{-x^{-\alpha}}

nS(anx+bn)xαnS(a_n x + b_n) \to x^{-\alpha}

1S(anx+bn)nxα\frac{1}{S(a_n x + b_n)} \to n x^\alpha

anx+bnγ(nxα)a_n x + b_n \to \gamma(n x^\alpha)

Transition from discrete nn to continuous tt.

a(t)x+b(t)γ(txα)a(t)x + b(t) \to \gamma(tx^\alpha)

Replace xx with y=xαy = x^\alpha:

γ(ty)b(t)a(t)y1/α\frac{ \gamma(ty) - b(t) }{a(t)} \to y^{1/\alpha}

Substitute y=1y=1 into this, yielding γ(t)b(t)a(t)1\frac{\gamma(t) - b(t)}{a(t)} \to 1.

Subtract this from the previous expression, resulting in:

γ(ty)b(t)a(t)γ(t)b(t)a(t)=γ(ty)γ(t)a(t)y1/α1\frac{ \gamma(ty) - b(t) }{a(t)} - \frac{\gamma(t) - b(t)}{a(t)} = \frac{\gamma(ty) - \gamma(t)}{a(t)} \to y^{1/\alpha} - 1

Step 2: a(t)a(t) is a regularly varying function

Substitute t=txt = tx and x=1xx = \frac{1}{x} into γ(ty)γ(t)a(t)\frac{ \gamma(ty) - \gamma(t) }{a(t)}, yielding: γ(tx1/x)γ(tx)a(tx)txx1/α1\frac{ \gamma(tx \cdot 1/x) - \gamma(tx) }{a(tx)} \xrightarrow{tx \to \infty} x^{-1/\alpha} - 1.

Now we can use this construct to show that a(t)a(t) is regularly varying:

Consider the ratio limta(tx)a(t)=limt(γ(tx)γ(t)a(t))/((γ(tx1/x)γ(tx)a(tx)))=\lim \limits_{t \to \infty} \frac{a(tx)}{a(t)} = \lim \limits_{t \to \infty} (\frac{ \gamma(tx) - \gamma(t) }{ a(t) }) / (-( \frac{ \gamma(tx \cdot 1/x) - \gamma(tx) }{ a(tx) })) =

=(x1/α1)/((x1/α1))=x1/α = (x^{1/\alpha} - 1) / (-(x^{-1/\alpha} - 1)) = x^{1/\alpha}.

Step 3: convergence of difference of integrals

Now we’re going to introduce another intermediate construct that we’ll use in this next step of this proof.

We aim to prove the following fact about a difference of integrals: limt2tyγ(s)dstya(t)2tγ(s)dsta(t)=αα+1(y1/α1)\lim \limits_{t \to \infty} \frac{\int \limits_{2}^{ty} \gamma(s) ds }{ty a(t)} - \frac{\int \limits_2^t \gamma(s)ds}{t a(t)} = \frac{\alpha}{\alpha + 1} (y^{1/\alpha} - 1).

First, notice that γ(tx)γ(t)\gamma(tx) - \gamma(t) as a function of tt is regularly varying with index of variation 1/α1/\alpha.

Indeed, a(t)a(t) is regularly varying with index 1/α1/\alpha and we consider the ratio γ(txy)γ(ty)α(ty)/γ(tx)γ(t)α(t)\frac{\gamma(txy) - \gamma(ty)}{\alpha(ty)} / \frac{\gamma(tx) - \gamma(t)}{\alpha(t)}, where both numerator and denominator tend to x1/α1x^{1/\alpha} - 1 as tt \to \infty (because tyty \to \infty as well for any fixed y). As α(ty)/α(t)ty1/α\alpha(ty) / \alpha(t) \xrightarrow{t \to \infty} y^{1/\alpha}, γ(txy)γ(ty)γ(tx)γ(t)α(ty)α(t)y1/α\frac{\gamma(txy) - \gamma(ty)}{\gamma(tx) - \gamma(t)} \to \frac{\alpha(ty)}{\alpha(t)} \to y^{1/\alpha}.

Apply ratio part of Karamata’s to the following: limtt(γ(tx)γ(t))2t(γ(sx)γ(s))ds=1/α+1=α+1α\lim \limits_{t \to \infty} \frac{t (\gamma(tx) - \gamma(t))}{\int \limits_{2}^{t} (\gamma(sx) - \gamma(s)) ds} = 1/\alpha + 1 = \frac{\alpha + 1}{\alpha}.

Invert this ratio to get: limt2t(γ(sx)γ(s))dst(γ(tx)γ(t))=αα+1\lim \limits_{t \to \infty} \frac {\int \limits_{2}^{t} (\gamma(sx) - \gamma(s)) ds} {t (\gamma(tx) - \gamma(t))} = \frac{\alpha}{\alpha + 1}.

In the numerator let’s take a closer look at the first term s=2tγ(sx)ds=1xsx=2xtxγ(sx)d(sx)\int \limits_{s=2}^{t} \gamma(sx) ds = \frac{1}{x} \int \limits_{sx=2x}^{tx} \gamma(sx) d(sx). Denoting u=sxu = sx, we get 1xu=2xtxγ(u)du\frac{1}{x} \int \limits_{u=2x}^{tx} \gamma(u) du.

We can prolong this integral’s left limit from 2x2x to 22 because we know that 22xγ(u)dut(γ(tx)γ(t))t0\frac{ \int \limits_2^{2x} \gamma(u) du }{ t (\gamma(tx) - \gamma(t)) } \xrightarrow{t \to \infty} 0 because we know that denominator of this ratio t(γ(ty)γ(t))tt (\gamma(ty) - \gamma(t)) \xrightarrow{t \to \infty} \infty because (γ(ty)γ(t))(\gamma(ty) - \gamma(t)) is a regularly varying function with a positive index of variation, hence, (γ(ty)γ(t))t(\gamma(ty) - \gamma(t)) \xrightarrow{t \to \infty} \infty by lemma 4.5.

Thus, our ratio of interest now is limt1y2tyγ(s)ds2tγ(s)dst(γ(ty)γ(t))=limt1y2tyγ(s)ds2tγ(s)dsta(t)(γ(ty)γ(t))a(t)\lim \limits_{t \to \infty} \frac{ \frac{1}{y} \int \limits_{2}^{ty} \gamma(s) ds - \int \limits_{2}^{t} \gamma(s) ds }{ t (\gamma(ty) - \gamma(t)) } = \lim \limits_{t \to \infty} \frac{ \frac{1}{y} \int \limits_{2}^{ty} \gamma(s) ds - \int \limits_{2}^{t} \gamma(s) ds }{ t a(t) \frac{(\gamma(ty) - \gamma(t))}{a(t)} }.

Recalling that (γ(ty)γ(t))a(t)(y1/α1)\frac{(\gamma(ty) - \gamma(t))}{a(t)} \to (y^{1/\alpha} - 1), we get limt2tyγ(s)dstya(t)2tγ(s)dsta(t)=αα+1(y1/α1)\lim \limits_{t \to \infty} \frac{\int \limits_{2}^{ty} \gamma(s) ds }{ty a(t)} - \frac{\int \limits_2^t \gamma(s)ds}{t a(t)} = \frac{\alpha}{\alpha + 1} (y^{1/\alpha} - 1).

Step 4: introduce intermediate function α(t)\alpha(t), show that it is regularly varying

Now thanks to the construct we introduced in the previous step of the proof, we are ready to introduce an intermediate function α(t)\alpha(t), show that it is regularly varying and express tail quantile function γ\gamma through it.

Consider δ1γ(ty)γ(t)a(t)tδ1(y1/α1)=δ1y1/αdy(1δ)\int \limits_{\delta}^{1} \frac{ \gamma(ty) - \gamma(t) }{ a(t) } \xrightarrow{t \to \infty} \int \limits_{\delta}^{1} (y^{1/\alpha} - 1) = \int \limits_{\delta}^{1} y^{1/\alpha} dy - (1 - \delta).

(1δ)γ(t)δttγ(s)dsta(t)11+α+O(δ)\frac{ (1 - \delta) \gamma(t) - \frac{ \int \limits_{\delta t}^t \gamma(s)ds }{t} } { a(t) } \to \frac{1}{1 + \alpha} + O(\delta), where O(δ)=δ+αα+1δα+1αO(\delta) = -\delta + \frac{ \alpha }{ \alpha + 1 } \delta^{\frac{\alpha + 1}{ \alpha }}

1a(t)(γ(t)2tγ(s)dstδ(γ(t)1δt2δtγ(s)ds))\frac{ 1 } { a(t) } \cdot (\gamma(t) - \frac{\int \limits_2^t \gamma(s)ds}{t} - \delta (\gamma(t) - \frac{1}{\delta t} \int \limits_2^{\delta t} \gamma(s) ds) )

Add and subtract 1a(t)δ2tγ(s)dst\frac{1}{a(t)} \delta \frac{\int \limits_2^t \gamma(s) ds}{t}:

1a(t)(γ(t)2tγ(s)dstδ(γ(t)2tγ(s)dst+2tγ(s)dst2δtγ(s)dsδt))=\frac{ 1 } { a(t) } \cdot ( \gamma(t) - \frac{ \int \limits_2^t \gamma(s) ds }{ t } - \delta (\gamma(t) - \frac{ \int \limits_2^t \gamma(s) ds }{t} + \frac{ \int \limits_2^t \gamma(s) ds }{t} - \frac{ \int \limits_2^{\delta t} \gamma(s) ds }{\delta t} ) ) =

=1a(t)((1δ)(γ(t)2tγ(s)dst)δ(1t2tγ(s)ds1δt2δtγ(s)ds))1α+1+O(δ) = \frac{ 1 } { a(t) } \cdot ( (1-\delta) (\gamma(t) - \frac{ \int \limits_2^t \gamma(s) ds }{t}) - \delta (\frac{1}{t} \int \limits_2^t \gamma(s)ds - \frac{1}{\delta t} \int \limits_2^{\delta t} \gamma(s) ds ) ) \to \frac{1}{\alpha + 1} + O(\delta).

Consider the second term: 1a(t)(1t2tγ(s)ds1δt2δtγ(s)ds)\frac{1}{a(t)} (\frac{1}{t} \int \limits_2^t \gamma(s)ds - \frac{1}{\delta t} \int \limits_2^{\delta t} \gamma(s) ds ).

In the previous step of the proof we’ve shown that it converges to α1+α(δ1/α1)\frac{\alpha}{1 + \alpha} (\delta^{1 / \alpha} - 1).

Hence, 1a(t)(1δ)(γ(t)2tγ(s)dst)α1+α(δ1/α1)11+α+O(δ)\frac{1}{a(t)} (1-\delta) (\gamma(t) - \frac{ \int \limits_2^t \gamma(s)ds }{t} ) - \frac{\alpha}{1 + \alpha} (\delta^{1 / \alpha} - 1) \to \frac{1}{1 + \alpha} + O(\delta)

And 1a(t)(γ(t)2tγ(s)dst)t(1/(1+α))+O(δ)+δα(1+α)(1δ1/α)(1δ)=(1/(1+α))δ+αδα+1α1α+1+δα(1/(α+1))(1δ1/α)1δ=11+α\frac{1}{a(t)} (\gamma(t) - \frac{ \int \limits_2^t \gamma(s)ds }{t} ) \xrightarrow{t \to \infty} \frac{ (1 / (1 + \alpha)) + O(\delta) + \delta \frac{\alpha}{ (1 + \alpha) } (1 - \delta^{1/\alpha}) }{ (1-\delta) } = \frac{ (1 / (1 + \alpha)) - \delta + \alpha \delta^{\frac{\alpha + 1}{\alpha} } \frac{1}{\alpha + 1} + \delta \alpha (1 / (\alpha + 1)) (1 - \delta^{1/\alpha}) }{1 - \delta} = \frac{1}{1 + \alpha}

Finally, introduce an intermediate function α(t)=γ(t)2tγ(s)dst\alpha(t) = \gamma(t) - \frac{ \int \limits_2^t \gamma(s)ds }{t}, and from the previous expression we now know that α(t)a(t)1+α\alpha(t) \sim \frac{a(t)}{1 + \alpha}, thus, being a regularly varying function with index of variation 1α\frac{1}{\alpha}.

Step 5: use α(t)\alpha(t) to show that γ(t)a(t)\gamma(t) \sim a(t)

α(t)=γ(t)2tγ(s)dst\alpha(t) = \gamma(t) - \frac{ \int \limits_{2}^{t} \gamma(s)ds }{ t }

Divide each side of this expression by tt and take integral from 22 to yy:

2yα(s)ds=2yγ(t)tdt2y1/y22tγ(s)dsdt\int \limits_2^y \alpha(s)ds = \int \limits_2^y \frac{\gamma(t)}{t} dt - \int \limits_2^y 1/y^2 \int \limits_2^t \gamma(s) ds dt

Investigate the second term 2y1/y22tγ(s)dsdt\int \limits_2^y 1/y^2 \int \limits_2^t \gamma(s) ds dt. Invert the order of integration:

2y(sy1t2dt)γ(s)ds=2y(1s1y)γ(s)ds=2y1sγ(s)ds1y2yγ(s)ds=2yγ(s)sds+α(s)γ(y)\int \limits_2^y (\int \limits_s^y \frac{1}{t^2}dt) \gamma(s) ds = \int \limits_2^y (\frac{1}{s} - \frac{1}{y}) \gamma(s) ds = \int \limits_2^y \frac{1}{s} \gamma(s) ds - \frac{1}{y} \int \limits_2^y \gamma(s) ds = \int \limits_2^y \frac{\gamma(s)}{s} ds + \alpha(s) - \gamma(y)

Substitute this back into the definition of α(t)\alpha(t), and we get:

γ(y)=α(y)+2yα(t)tdt\gamma(y) = \alpha(y) + \int_2^y \frac{\alpha(t)}{t}dt

Hence, γ(y)α(y)=1+2yα(t)tdtα(y)\frac{\gamma(y)}{\alpha(y)} = 1 + \frac{\int_2^y \frac{\alpha(t)}{t}dt }{\alpha(y)}, and in the right-hand side we get the inverse of ratio statement of Karamata’s theorem. Applying it, we get:

1+01t1/α1dt=1+11/α=1+α1 + \int \limits_0^1 t^{1/\alpha -1} dt = 1 + \frac{1}{1/\alpha} = 1 + \alpha.

Thus, limtγ(t)=limtα(t)(1+α)=limt1+α1+αa(t)=limta(t)\lim \limits_{t \to \infty} \gamma(t) = \lim \limits_{t \to \infty} \alpha(t) (1 + \alpha) = \lim \limits_{t \to \infty} \frac{1 + \alpha}{1 + \alpha} a(t) = \lim \limits_{t \to \infty} a(t)

Step 6: S(x)S(x) is regularly varying

γ=1S\gamma = \frac{1}{S}^{\leftarrow}. As γ\gamma is regularly varying with index of variation 1/α1/\alpha, its reciprocal 1S\frac{1}{S} is regularly varying with index of variation α\alpha by lemma 4.4.

If 1S\frac{1}{S} is regularly varying with index of variation α\alpha, then its inverse SS is regularly varying with index of variation α-\alpha.

So, S(tx)S(t)txα\frac{S(tx)}{S(t)} \xrightarrow{t \to \infty} x^{-\alpha}, which is the result we aimed to prove.

Theorem 4.2. Necessary and sufficient conditions of convergence to Type III (Inverse Weibull) EVD

A distribution of maximum of a random variable ξ\xi converges to Type III (Inverse Weibull) EVD if and only if xF<x_F < \infty and limt0Sξ(xFtx)Sξ(xFt)=xβ\lim \limits_{t \to 0} \frac{S_\xi(x_F - tx)}{S_\xi(x_F - t)} = x^{\beta}, where β>0\beta > 0, x>0x > 0, SξS_\xi is survival function.

Equivalent formulation (up to a change of notation): A distribution of maximum of randome variable ξ\xi converges to Type III (Inverse Weibull) EVD if and only if S(xF1tx)S(xF1t)txβ\frac{S(x_F - \frac{1}{tx})}{S(x_F - \frac{1}{t})} \xrightarrow{t \to \infty} x^{-\beta} (i.e. a regularly varying function with index of variation β- \beta).

Proof of direct statement:

In the proof of direct statement I’ll be using the first form of notation.

Given S(xFtx)S(xFt)xβ\frac{ S(x_F - t x) }{ S(x_F - t) } \to x^{\beta}, where β>0\beta > 0 and xF<x_F < \infty, we need to show that Fn(anx+bn)e(x)βF^n(a_n x + b_n) \to e^{-(-x)^{\beta}}.

The proof of direct statement is similar in spirit to the proof of von Mises sufficient conditions.

FMn(xFtx)=p(MnxFtx)=p(xFMntx)=pn(xFξtx)=(1p(xFξtx))n=(1S(xFtx))n(1S(xFt)xβ)nF_{M_n}(x_F - tx) = p(M_n \le x_F - tx) = p(x_F - M_n \ge tx) = p^n(x_F - \xi \ge tx) = (1 - p(x_F - \xi \le tx) )^n = (1 - S(x_F - tx))^n \to (1 - S(x_F - t) \cdot x^{\beta})^n

Now choose t=γ(n)t = \gamma(n) (a value of tail quantile function, such that S(xFt)=1nS(x_F - t) = \frac{1}{n}), and we get:

(1S(xFt)xβ)n=(11nxβ)n=e(x)β(1 - S(x_F - t) \cdot x^{\beta})^n = (1 - \frac{1}{n} \cdot x^{\beta})^n = e^{-(x)^{\beta}}.

Hence, FMn(anx+bn)=FMn(xF+tx)=FMn(tx+xF)=e(x)βF_{M_n}(a_n x + b_n) = F_{M_n}(x_F + tx) = F_{M_n}(tx + x_F) = e^{-(-x)^{\beta}} (note the change of sign, as I had to re-assign x=xx = -x, so that now x<0x < 0).

Proof of converse statement:

Given Fn(anx+bn)=e(x)βF^n(a_n x + b_n) = e^{-(-x)^{\beta}}, we need to show that S(xFtx)S(xFt)t0xβ\frac{S(x_F - tx)}{S(x_F - t)} \xrightarrow{t \to 0} x^{\beta}.

Again, this proof is extremely ugly and technical. Again, I copy-pasted it from the Resnick textbook, altering/simplifying/unifying the notation, and added my comments and clarifications everywhere.

Step 1: re-formulate the problem in terms of tail quantile function γ\gamma

Fn(anx+bn)e(x)βF^n(a_n x + b_n) \to e^{-(-x)^{-\beta}}

(1S(anx+bn))n=(1nS(anx+bn)n)nenS(anx+bn)e(x)β(1 - S(a_n x + b_n))^n = (1 - \frac{nS(a_n x + b_n)}{n})^n \approx e^{-nS(a_n x + b_n)} \to e^{-(-x)^\beta}

nS(anx+bn)(x)βnS(a_n x + b_n) \to (-x)^\beta

1nS(anx+bn)(x)β\frac{1}{n S(a_n x + b_n)} \to (-x)^{-\beta}

1S(anx+bn)n(x)β\frac{1}{S(a_n x + b_n)} \to n (-x)^{-\beta}

Recall that the tail quantile function is the reciprocal of inverse of the survival function: γ=1S\gamma = \frac{1}{S}^{\leftarrow}.

Set y=(x)βy = (-x)^{-\beta}, so that x=y1/βx = - y^{-1/\beta}, and apply γ\gamma to the previous expression, yielding:

an(y1/β)+bnγ(ny)a_n (-y^{-1/\beta}) + b_n \to \gamma(n y)

(γ(ny)bn)/an(y1/β)(\gamma(ny) - b_n) / a_n \to -(y^{-1/\beta}) and replace discrete nn with continuous variable tt, replacing bnb_n with b(t)b(t), ana_n with a(t)a(t) and using the fact that upon y=1y=1, we have γ(t1)b(t)a(t)\gamma(t \cdot 1) \to b(t) - a(t):

γ(ty)γ(t)a(t)=γ(ty)b(t)a(t)γ(t)b(t)a(t)1y1/β\frac{\gamma(ty) - \gamma(t)}{a(t)} = \frac{\gamma(ty) - b(t)}{a(t)} - \frac{\gamma(t) - b(t)}{a(t)} \to 1 - y^{-1/\beta}

Step 2: show that a(t)a(t) is a regularly varying function

Let’s show that a(t)a(t) is a regularly varying function by considering the ratio limta(tx)a(t)\lim \limits_{t \to \infty} \frac{a(tx)}{a(t)}.

To construct it from the result of previous step, consider y=txy = tx and γ(y1x)γ(y)a(y)\frac{\gamma(y \frac{1}{x}) - \gamma(y)}{a(y)} and then consider the ratio:

limta(tx)a(t)=γ(tx)γ(t)a(t)/γ(tx1x)γ(tx)a(tx)=(1x1/β)/(1x1/β)=x1/β\lim \limits_{t \to \infty} \frac{a(tx)}{a(t)} = \frac{\gamma(tx) - \gamma(t)}{a(t)} / \frac{\gamma(tx \cdot \frac{1}{x}) - \gamma(tx)}{a(tx)} = (1 - x^{-1/\beta}) / -(1 - x^{1/\beta}) = x^{-1/\beta}.

Thus, a(t)a(t) is a regularly varying function with index of variation 1β-\frac{1}{\beta} by definition.

Step 3: show that xF<x_F < \infty i.e. γ()<\gamma(\infty) < \infty

When does inverse of survival function has a finite upper end point? It happens if γ()<\gamma(\infty) < \infty. This is synonymous to saying that γ()γ(t0)\gamma(\infty) - \gamma(t_0) \le \infty (as γ(t0)\gamma(t_0) is finite when t0t_0 is finite, and we can subtract a finite quantity from each side of inequality). We can also divide this inequality by a finite positive a(t0)a(t_0), and we get γ()γ(t0)a(t0)\frac{\gamma(\infty) - \gamma(t_0)}{a(t_0)} \le \infty.

We can transform the previous expression to γ(kt0)γ(t0)a(t0)\frac{\gamma(k t_0) - \gamma(t_0)}{a(t_0)} \le \infty as kk \to \infty, and we need to show that this quantity is always bounded, even when kk \to \infty.

We obtained a form, similar to γ(yt)γ(t)a(t)t1y1/β\frac{\gamma(yt) - \gamma(t)}{a(t)} \xrightarrow{t \to \infty} 1 - y^{-1/\beta} from Step 1 of this proof. But this time we need to swap the roles of variables: we make t=t0t = t_0 arbitrarily large, but not infinite, and make y=ky = k \to \infty.

Let us assign t=2nt = 2^n and y=2y = 2:

γ(22n)γ(2n)a(2n)2n121/β\frac{\gamma(2 \cdot 2^n) - \gamma(2^n)}{a(2^n)} \xrightarrow{2^n \to \infty} 1 - 2^{-1/\beta}, which means that

γ(22n)γ(2n)a(2n)(121/β)ϵ\frac{\gamma(2 \cdot 2^n) - \gamma(2^n)}{a(2^n)} - (1 - 2^{-1/\beta}) \le \epsilon for arbitrarily small ϵ\epsilon upon large enough 2n2^n.

If we set ϵ=(121/ϵ)\epsilon = (1 - 2^{-1/\epsilon}), we get an estimate: γ(22n)γ(2n)a(2n)(121/β)2(121/β)\frac{\gamma(2 \cdot 2^n) - \gamma(2^n)}{a(2^n)} - (1 - 2^{-1/\beta}) \le 2 (1 - 2^{-1/\beta})

Apply lemma 4.6 to regularly varying function a(t)a(t) to get an upper bound, a(2n+1)a(2n)21/β+δ\frac{a(2^{n+1})}{a(2^n)} \le 2^{-1/\beta + \delta}.

Now we can come up with an upper bound for γ(2n2k)γ(2n)a(2n)\frac{\gamma(2^n \cdot 2^k) - \gamma(2^n)}{a(2^n)}:

γ(2n2k)γ(2n)a(2n)=j=1kγ(2n0+j)γ(2n0+j1)a(2n+j1)i=nnj2a(2i+1)a(2i)2(121/β)j=1k(2(1/βδ))j1\frac{\gamma(2^n \cdot 2^k) - \gamma(2^n)}{a(2^n)} = \sum \limits_{j=1}^k \frac{ \gamma(2^{n_0 + j}) - \gamma(2^{n_0 + j - 1}) }{a(2^{n + j -1})} \prod \limits_{i=n}^{n-j-2} \frac{a(2^{i+1})}{a(2^i)} \le 2 (1 - 2^{-1/\beta}) \sum \limits_{j=1}^k (2^{-(1/\beta - \delta)})^{j-1},

which proves that limkγ(2n2k)γ(2n)a(2n)<\lim \limits_{k \to \infty} \frac{\gamma(2^n \cdot 2^k) - \gamma(2^n)}{a(2^n)} < \infty.

Step 4: γ()γ(t)a(t)\gamma(\infty) - \gamma(t) \sim a(t) is a regularly varying function

Now that we know that γ()<\gamma(\infty) < \infty, we can introduce γˉ(t)=γ()γ(t)\bar{\gamma}(t) = \gamma(\infty) - \gamma(t). Its analysis will be a stepping stone.

Use the fact that γˉ(ty)γˉ(t)a(t)=γ()γ(ty)(γ()γ(t))a(t)1y1/β\frac{\bar{\gamma}(ty) - \bar{\gamma}(t)}{a(t)} = -\frac{\gamma(\infty) - \gamma(ty) - (\gamma(\infty) - \gamma(t)) }{a(t)} \to 1 - y^{-1/\beta}.

Divide this expression by y2y^2 and integrate: 11y2γˉ(ty)γˉ(t)a(t)dy1(y1/β21y2)dy\int \limits_1^\infty \frac{1}{y^2} \frac{\bar{\gamma}(ty) - \bar{\gamma}(t)}{a(t)} dy \to \int \limits_1^\infty (y^{-1/\beta-2} - \frac{1}{y^2})dy

Consider the first term: 1y1/β2dy=1d(y1/β1)(11/β1)=(11/β+1111/β+1)(11/β+1)=11/β+1(1)=β1+β\int \limits_1^\infty y^{-1/\beta-2} dy = \int \limits_1^\infty d (y^{-1/\beta - 1}) \cdot (\frac{1}{-1/\beta - 1}) = (\frac{1}{\infty}^{1/\beta + 1} - \frac{1}{1}^{1/\beta + 1}) \cdot (\frac{-1}{1/\beta + 1}) = \frac{-1}{1/\beta + 1} \cdot (-1) = \frac{\beta}{1 + \beta}

Consider the second term: 1y2dy=1d1y=(111)=1\int \limits_1^\infty y^{-2} dy = - \int \limits_1^\infty d \frac{1}{y} = -(\frac{1}{\infty} - \frac{1}{1}) = 1

Hence, 11y2γˉ(ty)γˉ(t)a(t)dyβ1+β1=11+β\int \limits_1^\infty \frac{1}{y^2} \frac{\bar{\gamma}(ty) - \bar{\gamma}(t)}{a(t)} dy \to \frac{\beta}{1 + \beta} - 1 = -\frac{1}{1 + \beta}

On the other hand, we can apply the following direct transformations:

11y2γˉ(ty)γˉ(t)a(t)dy=1a(t)(1γˉ(ty)y2dyγˉ(t)11y2dy)=1a(t)(tγˉ(s)s2ds+γˉ(t)(111))=1a(t)(γˉ(t)tγˉ(s)s2ds)\int \limits_1^\infty \frac{1}{y^2} \frac{\bar{\gamma}(ty) - \bar{\gamma}(t)}{a(t)} dy = \frac{1}{a(t)} (\int \limits_1^\infty \frac{\bar{\gamma}(ty)}{y^2}dy - \bar{\gamma}(t) \int \limits_1^\infty \frac{1}{y^2} dy) = \frac{1}{a(t)} (\int \limits_{t}^{\infty} \frac{\bar{\gamma}(s)}{s^2} ds + \bar{\gamma}(t)(\frac{1}{\infty} - \frac{1}{1})) = \frac{-1}{a(t)} (\bar{\gamma}(t) - \int \limits_{t}^{\infty} \frac{\bar{\gamma}(s)}{s^2} ds )

Note that we changed the integration limits above from [1,][1, \infty] to [t,][t, \infty] by changing the variable from yy to s=tys = ty. Finally, combine these two expressions:

1a(t)(γˉ(t)tγˉ(s)s2ds)11+β\frac{-1}{a(t)} (\bar{\gamma}(t) - \int \limits_{t}^{\infty} \frac{\bar{\gamma}(s)}{s^2} ds ) \to \frac{-1}{1 + \beta} or

γˉ(t)ttγˉ(s)dss2a(t)1+β\bar{\gamma}(t) - t \int \limits_t^\infty \frac{\bar{\gamma}(s)ds}{s^2} \to \frac{a(t)}{1 + \beta}.

Step 5: introduce intermediate function α(t)\alpha(t) and analyze it

Let us internalize the result of previous step. Denote the function we just achieved as an intermediate function α(t)=γˉ(t)ttγˉ(s)dss2\alpha(t) = \bar{\gamma}(t) - t \int \limits_t^\infty \frac{\bar{\gamma}(s)ds}{s^2}.

We’ve just shown that α(t)a(t)1+β\alpha(t) \sim \frac{a(t)}{1 + \beta} as tt \to \infty, meaning that it is regularly varying with index of variation 1/β-1/\beta as we’ve shown in step 2.

Dividing α(t)\alpha(t) by tt and integrating, we get:

t=yα(t)tdt=yγˉ(t)tdtt=ys=tγˉ(s)s2dsdt\int \limits_{t=y}^\infty \frac{\alpha(t)}{t}dt = \int \limits_y^\infty \frac{\bar{\gamma}(t)}{t} dt - \int \limits_{t=y}^\infty \int \limits_{s=t}^\infty \frac{\bar{\gamma}(s)}{s^2}ds dt

Consider the second term: t=ys=tγˉ(s)s2dsdt\int \limits_{t=y}^\infty \int \limits_{s=t}^\infty \frac{\bar{\gamma}(s)}{s^2}ds dt.

Notice that we can change the order of integration and limits of integrals by Fubini’s theorem:

t=ys=tγˉ(s)s2dsdt=s=yt=ysγˉ(s)s2dtds\int \limits_{t=y}^\infty \int \limits_{s=t}^\infty \frac{\bar{\gamma}(s)}{s^2}ds dt = \int \limits_{s=y}^\infty \int \limits_{t=y}^s \frac{\bar{\gamma}(s)}{s^2} dt ds

Fubini change of order of integration

Change of order of integration and integral limits by Fubini’s theorem.

Hence, s=yt=ysγˉ(s)s2dtds=s=y(tγˉ(s)s2t=yt=s)ds=s=ysγˉ(s)s2dsys=yγˉ(s)s2ds=s=yγˉ(s)sdsys=yγˉ(s)s2ds\int \limits_{s=y}^\infty \int \limits_{t=y}^s \frac{\bar{\gamma}(s)}{s^2} dt ds = \int \limits_{s=y}^\infty (t \frac{\bar{\gamma}(s)}{s^2}|_{t=y}^{t=s}) ds = \int \limits_{s=y}^\infty \frac{s \bar{\gamma}(s)}{s^2}ds - y \int \limits_{s=y}^\infty \frac{\bar{\gamma}(s)}{s^2}ds = \int \limits_{s=y}^\infty \frac{ \bar{\gamma}(s) }{s}ds - y \int \limits_{s=y}^\infty \frac{\bar{\gamma}(s)}{s^2}ds.

Substituting this representation of the second term back, we get: t=yα(t)tdt=yγˉ(t)tdts=yγˉ(s)sds+ys=yγˉ(s)s2ds\int \limits_{t=y}^\infty \frac{\alpha(t)}{t} dt = \cancel{\int \limits_y^\infty \frac{\bar{\gamma}(t)}{t} dt} - \cancel{\int \limits_{s=y}^\infty \frac{ \bar{\gamma}(s) }{s}ds} + y \int \limits_{s=y}^\infty \frac{\bar{\gamma}(s)}{s^2}ds.

Applying the definition of α(t)\alpha(t), we get: γˉ(y)α(y)=ys=yγˉ(s)s2ds=s=yα(s)sds\bar{\gamma}(y) - \alpha(y) = y \int \limits_{s=y}^\infty \frac{\bar{\gamma}(s)}{s^2}ds = \int \limits_{s=y}^\infty \frac{\alpha(s)}{s}ds.

γˉ(y)α(y)=1+1α(ys)α(y)sdsy1+1s1/βsds=1+11/β1ds1/β=1+11/β(11/β111/β)=1+β\frac{\bar{\gamma}(y)}{\alpha(y)} = 1 + \int \limits_1^\infty \frac{\alpha(ys)}{\alpha(y)s} ds \xrightarrow{y \to \infty} 1 + \int \limits_1^{\infty} \frac{s^{-1/\beta}}{s} ds = 1 + \frac{1}{-1/\beta} \int \limits_1^{\infty} d s^{-1/\beta} = 1 + \frac{1}{-1/\beta} (\frac{1}{\infty^{1/\beta}} - \frac{1}{1^{1/\beta}}) = 1 + \beta.

Hence, γˉ(y)(1+β)α(y)(1+β)1+βa(y)=a(y)\bar{\gamma}(y) \sim (1 + \beta)\alpha(y) \sim \frac{ (1 + \beta) }{ 1 + \beta } a(y) = a(y) as yy \to \infty.

As we’ve shown above a(y)a(y) is a regularly varying function, thus, γˉ(y)=γ()γ(y)\bar{\gamma}(y) = \gamma(\infty) - \gamma(y) is, too.

Step 6: S(xFt)S(x_F - t) is a regularly varying function

Function γˉ(y)\bar{\gamma}(y) is reciprocal of 1S(xFx)\frac{1}{S(x_F - x)}, as 1S(xFγˉ(y))=1S(γ(y))=y\frac{1}{S(x_F - \bar{\gamma}(y))} = \frac{1}{S(\gamma(y))} = y.

Applying lemma 4.4 on reciprocal of a regularly varying function, we see that 1S(xFx)\frac{1}{S(x_F - x)} is regularly varying as well. Then its inverse S(xFx)S(x_F - x) is regularly varying, too.

Necessary and sufficient conditions of convergence to Type I (Gumbel) EVD

Finally, we’ve come to the most complex result in this series. However, we’ll need an extra set of definitions and lemmas. Regular variation theory is enough for Type II and Type III EVD. However, for Type I we’ll need to extend this concept even further.

We’ll introduce extended concepts of variation, Г-variation and П-variation. Then we’ll show the two-way connection between survival function, its tail quantile function and Type I SVD.

Definition 4.3: Г-variation

A non-decreasing function SS is said to be of Г-variation if S(t+xf(t))S(t)txFex\frac{S(t + xf(t))}{S(t)} \xrightarrow{t \to x_F} e^{x}, where ff is a positive auxiliary function.

Interestingly, all the auxiliary functions are asymptotically equivalent because of Khinchin’s lemma.

Consider the inverse S(t)S(t+xf(t))\frac{S(t)}{S(t + xf(t))} (because we need a distribution function to be in [0, 1] range), convert this ratio into a cumulative distribution function Fξ(x)=1S(t)S(t+xf(t))1exF_\xi(x) = 1 - \frac{S(t)}{S(t + xf(t))} \sim 1 - e^{-x} that converges to an exponential distribution.

Now if we had two different auxiliary functions f1(t)f_1(t) and f2(t)f_2(t), we can use them as sequences of coefficients in a pair of distributions: Fξ(f1(t)x)F_\xi(f_1(t)x) and Fξ(f2(t)x)F_\xi(f_2(t)x), which both converge to the same 1ex1 - e^{-x}, hence f1(t)f2(t)f_1(t) \sim f_2(t) as tt \to \infty.

The property of Г-variation of the survival function will be instrumental in proving the necessary condition of convergence of maximum of sets of i.i.d random variables of some distributions to Type I Gumbel EVD.

Definition 4.4: П-variation

A non-decreasing, non-negative function V(x)V(x) defined on (z,)(z, \infty) is П-varying if there exist functions a(t)>0a(t) > 0, b(t)b(t), such that for x>0x > 0:

limtV(tx)b(t)a(t)=lnx\lim \limits_{t \to \infty} \frac{V(tx) - b(t)}{a(t)} = \ln x

Note that one can take b(t)=V(t)b(t) = V(t) because:

V(tx)V(x)a(t)=V(tx)b(t)a(t)V(t)b(t)a(t)lnxln1=lnx\frac{V(tx) - V(x)}{a(t)} = \frac{V(tx) - b(t)}{a(t)} - \frac{V(t) - b(t)}{a(t)} \to \ln x - \ln 1 = \ln x

and similarly one may take a(t)=V(te)V(t)a(t) = V(te) - V(t) because V(te)V(t)V(te)V(t)=lne=1\frac{V(te) - V(t)}{V(te) - V(t)} = \ln e = 1.

The key to proving that conditions of convergence are both necessary and sufficient is the dualism between survival function and its inverse.

Turns out that if the survival function is Г-varying, its reverse function SS^{\leftarrow} is П-varying, which lets us establish that conditions are not only sufficient, but necessary as well.

Lemma 4.7: If a function is Г-varying, its reciprocal is П-varying and vice versa

Direct statement

If UU is Г-varying with auxiliary function ff, then UU^{\leftarrow} is П-varying, where auxiliary function a(t)=fU(t)a(t) = f \circ U^{\leftarrow}(t).

Reverse statement

If VV is a П-varying function with auxiliary function aa, then its reciprocal is a Г-varying with auxiliary function f(t)=aV(t)f(t) = a \circ V^{\leftarrow}(t).

Proof:

Direct statement

Start from the definition of Г-varying function limtU(t+xf(t))U(t)=ex\lim \limits_{t \to \infty} \frac{U(t + x f(t))}{U(t)} = e^x.

Denote y=exy = e^x.

U(t+lnyf(t))txFyU(t)U(t + \ln y \cdot f(t)) \xrightarrow{t \to x_F} y U(t)

Apply reciprocation to both sides:

t+lnyf(t)txFU(yU(t))t + \ln y \cdot f(t) \xrightarrow{t \to x_F} U^{\leftarrow}(y U(t))

Assign tt a value of U(t)U^{\leftarrow}(t), so that now tt \to \infty:

U(t)+lnyf(U(t))tU(yt)U^{\leftarrow}(t) + \ln y \cdot f(U^{\leftarrow}(t)) \xrightarrow{t \to \infty} U^{\leftarrow}(yt)

U(ty)U(t)f(U(t))lny\frac{ U^{\leftarrow}(ty) - U^{\leftarrow}(t) }{ f( U^{\leftarrow}(t) ) } \to \ln y, which is the definition of П-varying function with auxiliary function a=fUa = f \circ U^{\leftarrow}.

Reverse statement

TODO: analogous

Lemma 4.8: Connection between Type I EVD, Г-variation and П-variation

The following statements are equivalent:

  1. F(x)F(x) is of Type I EVD
  2. 1/S(x)1/S(x) is of Г-variation
  3. (1/S(x))(1/S(x))^{\leftarrow} is of П-variation (γ(S)\gamma(S) is a tail quantile function as defined above)

Proof of (1) -> (3)

If ξF(x)\xi \sim F(x) and Mn=maxξneexM_n = \max \xi_n \sim e^{-e^{-x}}, it means that:

FMn(anx+bn)=Fξ(anx+bn)n=eex(1ex/n)nF_{M_n}(a_n x + b_n) = F_{\xi}(a_n x + b_n)^n = e^{-e^{-x}} \leftarrow (1 - e^{-x}/n)^n

Fξ(anx+bn)(1ex/n)F_{\xi}(a_n x + b_n) \to (1 - e^{-x}/n)

n(1Fξ(anx+bn))exn (1 - F_{\xi}(a_n x + b_n)) \to e^{-x}

Sξ(anx+bn)1/(nex)S_{\xi}(a_n x + b_n) \to 1 / (n e^x)

Similarly to the previous lemma denote y=exy = e^{x}:

1Sξ(anlny+bn)=ny\frac{1}{ S_{\xi}(a_n \ln y + b_n) } = n y

Denote UU the inverse of survival function: 1Sξ(anlny+bn)=U(anlny+bn)=ny\frac{1}{ S_{\xi}(a_n \ln y + b_n) } = U(a_n \ln y + b_n) = ny

Apply the reciprocal UU to both sides:

anlny+bn=U(ny)a_n \ln y + b_n = U^{\leftarrow}(ny) or

U(ny)bnan=lny\frac{U^{\leftarrow}(ny) - b_n}{a_n} = \ln y

Or, applying U(y)=bnU^{\leftarrow}(y) = b_n and we get the definition of П-variation:

U(ny)U(y)an=lny\frac{U^{\leftarrow}(ny) - U^{\leftarrow}(y)}{a_n} = \ln y

Proof of (3) -> (2)

This is the reverse statement of the previous lemma, nothing needs to be done here.

Proof of (2) -> (1)

Given U(x)=1S(x)U(x) = \frac{1}{S(x)} and U(t+xf(t))U(t)ex\frac{U(t + x f(t))}{U(t)} \to e^{x}, we need to show that F(x)eexF(x) \to e^{-e^{-x}}.

Set t=U(n)t = U^{\leftarrow}(n): U(U(n)+xf(U(n)))U(U(n))ex\frac{U(U^{\leftarrow}(n) + x f(U^{\leftarrow}(n)))}{U(U^{\leftarrow}(n))} \to e^{x}, which leads to:

U(U(n)+xf(U(n)))nexU(U^{\leftarrow}(n) + x f(U^{\leftarrow}(n))) \to n e^{x}

Now assume bn=U(n)b_n = U^{\leftarrow}(n), an=f(U(n))a_n = f(U^{\leftarrow}(n)), and we come to U(anx+bn)nexU(a_n x + b_n) \to n e^{x}:

1S(anx+bn)nex\frac{1}{S(a_n x + b_n)} \to n e^{x}, hence, nS(anx+bn)=exn S(a_n x + b_n) = e^{-x} or F(anx+bn)1exnF(a_n x + b_n) \to 1 - \frac{e^{-x}}{n}.

And finally FMn(x)=Fn(anx+bn)=(1exn)n=eexF_{M_n}(x) = F^{n}(a_n x + b_n) = (1 - \frac{e^{-x}}{n})^n = e^{-e^{-x}}.

Theorem 4.3. Necessary and sufficient conditions of convergence to Type I EVD

A distribution belongs to Extreme Value Distribution type I (Gumbel) domain of attraction if and only if limtxFS(t+xg(t))S(t)=ex\lim \limits_{t \to x_F} \frac{S(t + x g(t))}{S(t)} = e^{-x}, where xRx \in \mathbb{R} and g(t)g(t) is an auxiliary function (which is not uniquely-defined, but they all are asymptotically equivalent by Khinchin’s lemma) e.g. could be set to inverse hazard rate g(t)=1r(t)=S(t)f(t)g(t) = \frac{1}{r(t)} = \frac{S(t)}{f(t)}.

Proof

Result immediately follows from Lemma 4.8.

Remember that we’ve shown that survival function being a von Mises function is sufficient for its maximum converge to Gumbel Type I EVD.

Now we can slightly generalize that condition and show that it is necessary and sufficient.

Theorem 4.4. Generalization of Von Mises function

A distribution’s maximum converges to Gumbel Type I EVD, if and only if its survival function can be represented as:

S(x)=c(x)S(x)=c(x)exp(x1f(u)du)S(x) = c(x) S^{*}(x) = c(x) exp(-\int \limits_{-\infty}^{x} \frac{1}{f(u)} du), where SS^{*} is a von Mises function and limxxFc(x)=c>0\lim \limits_{x \to x_F} c(x) = c > 0.

Proof of direct statement

As SS^{*} is a von Mises function, nS(anx+bn)exn S^{*}(a_n x + b_n) \to e^{-x}. Hence,

nc(x)S(anx+bn)cexn c(x) S^{*}(a_n x + b_n) \to c e^{-x} and

Fn(anx+bn)exp(cex)F^n(a_n x + b_n) \to exp(-c e^{-x})

Outline of proof of converse statement

This proof is very technical (5 pages of boring and uninstructive math, you’ll never need again), and I won’t give it in full, as these details are basically useless. However, I shall provide the outline of proof structure, as in Resnick book.

By Lemma 4.8 if our cumulative distribution function FD(Gumbel)F \in \mathcal{D}(Gumbel), we know that (1S(x))=V(\frac{1}{S(x)})^{\leftarrow} = V, where VV is П-varying.

Consider VV^{\leftarrow}. It is possible to show that there exist such functions V1V^{\leftarrow}_1 and V2V^{\leftarrow}_2 that limsV(s)V1(s)=e\lim \limits_{s \to \infty} \frac{ V^{\leftarrow}(s) }{ V_1^{\leftarrow}(s) } = e and limsV1(s)V2(s)=e\lim \limits_{s \to \infty} \frac{ V_1^{\leftarrow}(s) }{ V_2^{\leftarrow}(s) } = e.

Hence, S(x)1eS1(x)S(x) \sim \frac{1}{e} S_1(x) and S1(x)1eS2(x)S_1(x) \sim \frac{1}{e} S_2(x), hence, S(x)1e2S2(x)S(x) \sim \frac{1}{e^2} S_2(x).

Surprisingly, it turns out that V2V_2^{\leftarrow} always corresponds to a von Mises survival function S(x)S^*(x): S(x)=1V2S^{*}(x) = \frac{1}{ V_2^{\leftarrow} }.

This leads us to S(x)1e2S(x)S(x) \sim \frac{1}{e^2} S^{*}(x).

Theorem 4.5. Sufficient condition for a distribution not to belong to a domain of attraction of max-stable distributions

A distribution belongs to some type of Extreme Value Distribution if and only if S(x)S(x)x1\frac{S(x)}{S(x-)} \xrightarrow{x \to \infty} 1.

Here S(x)S(x-) denotes the left limit of survival function as it approaches xx.

Proof for the case of EVD type I

From the previous theorem and representation of survival function as a generalization of von Mises function, we immediately get S(x)S(x)=c(x)c(x)\frac{ S(x) }{ S(x-) } = \frac{ c(x) }{ c(x-) } and since both converge to cc, this concludes the proof.

Proof in general case

I’ve seen some ugly and technical proofs of this statement, e.g. in the textbook by Leadbetter et al.. I believe, Leadbetter’s proof contains an arithmetic mistake or a typo (or I made an arithmetic mistake, while working through it, which seems more probable, but I can’t find it).

Instead of proving it directly, I suggest inferring this result from the Pickands-Balkema-de Haan theorem, proven in the next part of this text. From that theorem we know, that the ratio S(t+Δxg(t))S(t)t(1+γΔx)1/γ\frac{S(t + \Delta x g(t))}{S(t)} \xrightarrow{t \to \infty} (1 + \gamma \Delta x)^{1/\gamma} if and only if the survival function is in the domain of attraction of Generalized Extreme Value Distribution (here g(x)g(x) is auxiliary function).

If we set Δx0\Delta x \to 0, t=xt = x- (i.e. almost an integer number) and t+Δx=xt + \Delta x = x (i.e. full integer number), we immediately get S(x)S(x)1\frac{S(x)}{S(x-)} \to 1.

Example 4.1: Geometric distribution

For geometric distribution (describing probability of getting kk failures before the first success, upon which the process stops), probability mass function equals P(ξ=k)=(1p)kpP(\xi = k) = (1-p)^k p.

Hence, the cumulative distribution function is Fξ(x)=1p(ξx)=1pi=[x]+1(1p)i=F_{\xi}(x) = 1 - p(\xi \ge x) = 1 - p \sum \limits_{i = [x] + 1}^{\infty} (1 - p)^i =

=1p(1p)([x]+1)(1+(1p)1+(1p)2+...)=1p(1p)([x]+1)11(1p)=1(1p)[x]+1= 1 - p (1 - p)^{([x] + 1)} \cdot (1 + (1-p)^1 + (1-p)^2 + ...) = 1 - p (1 - p)^{([x] + 1)} \cdot \frac{1}{1 - (1 - p)} = 1 - (1-p)^{[x]+1}.

Here [x][x] denotes the integer part of the number. E.g. if x=1.5x = 1.5, the probability of having less than 1.5 failures before the first success is probability of having 1 failure before the first success.

Hence, for positive integer xx, Fξ(x)=1(1p)xF_{\xi}(x-) = 1 - (1-p)^{x} and S(x)S(x)=(1p)x+1(1p)x=(1p)1\frac{S(x)}{S(x-)} = \frac{ (1 - p)^{x+1} }{ (1 - p)^x } = (1 - p) \ne 1.

Example 4.2: Poisson distribution

Similarly to geometric distribution, in case of Poisson distribution we have cumulative density function F(x)=k=0[x]eλλkk!F(x) = \sum \limits_{k=0}^{[x]} e^{-\lambda} \frac{\lambda^k}{k!}, where [x][x] is the floor of xx.

Survival function S(x)=1F(x)=k>[x]eλλkk!S(x) = 1 - F(x) = \sum \limits_{k > [x]} e^{-\lambda} \frac{\lambda^k}{k!}.

Hence, for integer xx, S(x)S(x)=k>xλk/k!k>x1λk/k!\frac{S(x)}{S(x-)} = \frac{\sum \limits_{k > x} \lambda^k / k!}{\sum \limits_{k > x - 1} \lambda^k / k!}.

Consider the value S(x)S(x)1=S(x)S(x)S(x)\frac{S(x)}{S(x-)} - 1 = \frac{S(x) - S(x-)}{S(x-)}, which is supposed to approach 0 if S(x)S(x)1\frac{S(x)}{S(x-)} \to 1.

S(x)S(x)1=λx/xk>xλk/k!=1s=1λs(x+1)(x+2)...(x+s)1s=1(λx)s=1λ/xλ/x\frac{S(x)}{S(x-)} - 1 = \frac{\lambda^x / x}{ \sum \limits_{k > x} \lambda^k / k! } = \frac{1}{ \sum \limits_{s = 1}^{\infty} \frac{\lambda^s }{ (x + 1) (x + 2) ... (x + s) } } \ge \frac{1}{ \sum \limits_{s = 1}^{\infty} (\frac{\lambda}{x})^s } = \frac{1 - \lambda/x}{\lambda/x}.

As xx \to \infty this value approaches infinity, hence, proving that Poisson disitrubtion does not converge to any EVD.

5. Residual life time

James Pickands III A.A. (Guus) Balkema Laurens de Haan
Pickands Laurens de Haan

Generalized Pareto distribution

Finally, I want to discuss a very practical application of Extreme Value Theory. Let’s look at the Generalized Extreme Value distribution again:

Gγ(x)=exp((1+γx)1γ)G_{\gamma}(x) = exp(-(1 + \gamma x)^{-\frac{1}{\gamma}}).

Negative of expression that we find in the power of exponent can be interpreted as a survival function of a distribution by itself, called Generalized Pareto Distribution:

S(x)=(1+γx)1γS(x) = (1 + \gamma x)^{-\frac{1}{\gamma}}, F(x)=1(1+γx)1γF(x) = 1 - (1 + \gamma x)^{-\frac{1}{\gamma}}

What meaning can we attribute to it?

Well, first interpretation is straightforward: recall the correspondence between cumulative hazard rate R(t)R(t) and survival function S(t)S(t):

S(t)=eR(t)S(t) = e^{-R(t)}

Hence, if cumulative distribution function of maximum of your random variables is in Generalized Extreme Value distribution, the corresponding cumulative hazard rate is in Generalized Pareto distribution.

  • if γ\gamma is positive, cumulative hazard rate is in the domain of attraction of Type II Frechet EVD, and you can live forever (maybe the chances are slim, but in theory it is possible);
  • if γ=0\gamma = 0, cumulative hazard rate is in the domain of attraction of Type I Gumbel EVD, and it can go either way (as both bounded and unbounded distributions converge to Type I Gumbel EVD);
  • if γ\gamma is negative, cumulative hazard rate is in the domain of attraction of Type III Inverse Weibull EVD, and the right end point is finite.

Residual life time problem

However, there is another, maybe even more interesting use case for Generalized Pareto distribution. Recall that while proving conditions for Type I distribution (both von Mises and necessary and sufficient) we came up with a conditional probability construction:

p(ξug(u)>xξ>u)p( \frac{\xi - u}{g(u)} > x | \xi > u)

Here g(u)g(u) was an auxiliary function (all the auxiliary functions are asymptotically equivalent as uu \to \infty by Khinchin’s theorem) and ξ\xi was our random variable, and we were striving to show that if uu is close to the right end point xFx_F, the following holds:

S(u+xg(u))S(u)uxFex\frac{S(u + x g(u))}{S(u)} \xrightarrow{u \to x_F} e^{-x}

In other words, if you’ve reached the age uu, close enough to the right end point of our survival function SξS_\xi, your chances to survive xx extra years, normalized by auxiliary function g(u)g(u), are Generalized Pareto-distributed.

Turns out, a similar statement holds for other types of EVD: if your survival function lies in the domain of attraction of some type of EVD, then corresponding residual life time converges to Generalized Pareto Distribution. Moreover, it turns out that this statement works in both directions: if residual life time distribution is Generalized Pareto, the survival function is in domain of attraction of generalized EVD. This fact can be proved as a theorem, known as Pickands-Balkema-de Haan theorem (sometimes called Second Extreme Value Theorem).

Pickands-Balkema-de Haan theorem (a.k.a. Second Extreme Value Theorem)

Theorem 5.1. Pickands-Balkema-de Haan theorem (Second Extreme Value Theorem)

Normalized residual life time distribution S(t+xg(t))S(t)tG(x)\frac{ S(t + x g(t)) } {S(t)} \xrightarrow{t \to \infty} G(x), where G(x)={(1+γx)1/γexG(x) = \begin{cases} (1 + \gamma x)^{-1/\gamma} \\ e^{-x} \end{cases}, where tt \to \infty is some time point sufficiently large, after which this theorem holds, and g(t)g(t) is an auxiliary function for normalization.

Note: Moreover, we can generalize even more and relax conditions to S(b(t)+xg(t))tG(x)S(b(t) + x g(t)) \xrightarrow{t \to \infty} G(x), then limit distributions set is slightly extended. I am not going to go into details, see Balkema, de Haan 1974 paper.

Proof of direct statement:

Type I (Gumbel)

By necessary and sufficient condition of survival function attraction to Gumbel Type I EVD we know that:

S(u+xg(u))S(u)uxFex\frac{S(u + x g(u))}{S(u)} \xrightarrow{u \to x_F} e^{-x}

Note that exe^{-x} is a special case of generalized Pareto distribution upon γ0\gamma \to 0: (1+γx)1γγ0ex(1 + \gamma x)^{-\frac{1}{\gamma}} \xrightarrow{\gamma \to 0} e^{-x}.

Type II (Frechet)

This proof is based on a proof sketch from Embrechts textbook, appendix A3.3 and Theorem 3.4.5 and Theorem 6 from Balkema, de Haan 1974 paper with omissions filled in.

By the necessary and sufficient conditions we know that S(tx)S(t)=xα\frac{S(tx)}{S(t)} = x^{-\alpha}. Hence, S(x)S(x) is a regularly varying function with index of variation α\alpha.

By Karamata’s characterization theorem: S(x)=c(x)e1xα(t)tdtS(x) = c(x) e^{\int \limits_1^x \frac{-\alpha(t)}{t}} dt, where α(t)tα\alpha(t) \xrightarrow{ t \to \infty } \alpha, c(t)tconstc(t) \xrightarrow{ t \to \infty } const.

Hence, S(u+xg(u))S(u)uxFc(u+xg(u))e1u+xg(u)α(t)tdtc(u)e1uα(t)tdte1u+xg(u)α(t)tdt+1uα(t)tdt=euu+xg(u)α(t)tdteα(ln(u+xg(u))lnu)=(u+xg(u)u)α\frac{S(u + x g(u))}{S(u)} \xrightarrow{u \to x_F} \frac{ c(u + x g(u)) \cdot e^{ - \int \limits_1^{u + x g(u)} \frac{\alpha(t)}{t}} dt } { c(u) \cdot e^{ - \int \limits_1^u \frac{ \alpha(t) }{t} } dt } \approx e^{ - \int \limits_1^{u + x g(u)} \frac{\alpha(t)}{t} dt + \int \limits_1^u \frac{\alpha(t)}{t} dt } = e^{ - \int \limits_u^{u + x g(u)} \frac{\alpha(t)}{t} dt } \approx e^{ - \alpha (\ln (u + x g(u)) - \ln u)} = (\frac{u + x g(u)}{ u })^{-\alpha}.

Auxiliary function g(u)g(u) is tail-equivalent to 1h(u)\frac{1}{h(u)} by Khinchin’s lemma, where h(u)h(u) is hazard rate.

Hazard rate is the ratio of probability density and survival function, and we know that survival function is regularly varying.

Hence, by Karamata’s theorem ratio statement xh(x)=xS(x)S(x)(α)=αx h(x) = \frac{ - x S'(x)}{S(x)} \to - (-\alpha) = \alpha.

Thus, ug(u)α\frac{u}{g(u)} \to \alpha, we get S(u+xg(u))S(u)uxF(u+xg(u)u)α(1+x/α)α\frac{S(u + x g(u))}{S(u)} \xrightarrow{u \to x_F} (\frac{u + x g(u)}{u})^{-\alpha} \approx (1 + x/\alpha)^{-\alpha}.

Denoting γ=1α\gamma = \frac{1}{\alpha} we get the desired result S(u+xg(u))S(u)u(1+γx)1γ\frac{S(u + x g(u))}{S(u)} \xrightarrow{u \to \infty} (1 + \gamma x)^{-\frac{1}{\gamma}}.

Type III (Inverse Weibull)

This version of proof is similar to Type II, but with appropriate substitution of Karamata’s characterization of survival function for Type II with its Type III analogue from Resnick’s textbook corollary 1.14.

The proof consists of two steps.

The first step is to show that if xF<x_F < \infty survival function has the following representation S(x)=c(x)exF1xβ(t)xFtdtS(x) = c(x) e^{- \int \limits_{x_F - 1}^x \frac{ \beta(t) }{ x_F - t } dt }, where β(t)txFβ\beta(t) \xrightarrow{t \to x_F} \beta, c(t)txFconstc(t) \xrightarrow{t \to x_F} const.

From necessary and sufficient conditions of Type III EVD domain of attraction:

S(xF1tx)S(xF1t)txβ\frac{ S(x_F - \frac{1}{tx}) }{ S(x_F - \frac{1}{t}) } \xrightarrow{t \to \infty} x^{-\beta}

Hence, by Karamata theorem’s ratio statement: S(xF1x)=c~(x)exF1xβ~(t)tdtS(x_F - \frac{1}{x}) = \tilde{c}(x) e^{- \int \limits_{x_F - 1}^{x} \frac{\tilde{\beta}(t)}{t} dt }, where β~(t)txFβ\tilde{\beta}(t) \xrightarrow{t \to x_F} \beta and c~(t)txFconst\tilde{c}(t) \xrightarrow{t \to x_F} const.

Denote y=xF1xy = x_F - \frac{1}{x}, so that x=1xFyx = \frac{1}{ x_F - y }. Then:

S(y)=c~(1xFy)e11xFyβ~(t)tdtS(y) = \tilde{c}(\frac{1}{x_F - y}) e^{- \int \limits_{1}^{\frac{1}{x_F - y}} \frac{ \tilde{\beta}(t) }{ t } dt }

Changing variables in the integral to s=xF1ts = x_F - \frac{1}{t} results in:

S(y)=c~(1xFy)exF1yβ~(1xFs)xFsds=c(y)exF1yβ(s)xFsdsS(y) = \tilde{c}( \frac{1}{x_F - y} ) e^{- \int \limits_{x_F - 1}^{y} \frac{ \tilde{\beta}( \frac{1}{x_F - s} ) }{ x_F - s } ds } = c(y) e^{- \int \limits_{x_F - 1}^{y} \frac{\beta(s)}{x_F - s} ds}

Now the second step of the proof is to apply this representation of survival function:

S(u+xg(u))S(u)uxFeuu+xg(u)β(t)xFtdt=eβ(ln(xF(u+xg(u)))ln(xFu))=(xFuxg(u)xFu)β=(1xg(u)xFu)β\frac{S(u + x g(u))}{S(u)} \xrightarrow{u \to x_F} e^{-\int \limits_{u}^{u + x g(u)} \frac{\beta(t)}{x_F - t} dt } = e^{\beta (\ln (x_F - (u + x g(u))) - \ln (x_F - u)) } = (\frac{ x_F - u - xg(u) } { x_F - u } )^\beta = (1 - \frac{x g(u)}{ x_F - u } )^{\beta}.

Last thing left to show is that g(u)xFu1β\frac{ g(u) }{ x_F - u } \to -\frac{1}{\beta}. This follows from the fact that auxiliary function g(u)g(u) is the inverse of hazard rate, and the fact that S(xFu)S(x_F - u) is regularly varying function.

Denoting γ=1β\gamma = -\frac{1}{\beta}, we obtain the desired result.

Outline of the proof of converse statement:

There are 2 proofs of converse statement: in the original 1974 Balkema, de Haan paper and 1975 Pickands paper.

Both proofs are completely unintelligible to me.

Instead of an exact proof, I’d just stick with the fact that if we know that:

S(u+xg(u))S(u)u(1+γx)1/γ\frac{S(u + x g(u))}{S(u)} \xrightarrow{u \to \infty} (1 + \gamma x)^{-1 / \gamma}

We know that all the g(u) are asymptotically equivalent from Khinchin’s lemma. Then we can use the fact that we already know that g(u)=F(u)1F(u)g^*(u) = \frac{-F'(u)}{1 - F(u)} fits, and our g(u)g(u)g(u) \sim g^*(u) by Khinchin’s lemma upon uu \to \infty.

This allows us to reverse the discussion in the direct statement for each distribution type and end up with necessary and sufficient conditions of convergence to EVD.

6. Order statistics and parameter estimation

Extreme Value Theorem provides us with the distribution of maximum.

However, as an almost free bonus, we can also infer the distribution of 2nd largest maximum, 3rd and, generally, k-th maximum from EVT and Poisson distribution. We’ll need just a few basic results from the order statistics toolbox in order to infer the asymptotic distribution of k-th maximum.

In the second part of this chapter we will use this toolbox as well to study 2 of the best known estimators of EVD parameters, Hill and Pickands esimators, which are based on order statistics as well.

Order statistics

Definition 6.1. Order statistics

Given a set of samples {ξi}\{ \xi_i \} from an arbitrary distribution ξ\xi, one can re-order them in an increasing order. Then ordered list {ξ(k)}\{ \xi_{(k)} \}, such that ξ(k)<ξ(k+1)\xi_{(k)} < \xi_{(k+1)} is called k-th order statistics.

Basically, the k-th smallest value is called k-th order statistic.

Proposition 6.1. Distribution of k-th order statistics

Cumulative distribution function of kk-th order statistic from distribution ξ\xi is Fξ(k)(x)=n!(nk)!k!Fξk(1Fξ)nkF_{ \xi_{(k)} }(x) = \frac{n!}{(n-k)! k!} F_{\xi}^k (1 - F_{\xi})^{n-k}.

Proof:

For a kk-th order statitic to equal xx, nkn-k points should be no greater than xx and kk points should be no lesser than xx.

Which exactly points fall in both categories - does not matter, hence, there are CnkC_n^k ways to rearrange them.

Hence, Fξ(k)(x)=CnkFξ(x)k(1Fξ(x))nkF_{\xi_{(k)}}(x) = C_n^k F_\xi(x)^k (1 - F_\xi(x))^{n-k}.

Theorem 6.1. Poisson-like distribution of k-th order statistics from Extreme Value Distribution

Cumulative distribution function of k-th maximum of a function that is in the domain of attraction of Generalized EVD G(x)G(x) is Poisson-like:

Fξ(k)(anx+bn)G(x)(lnG(x))kk!F_{\xi_{(k)}}(a_n x + b_n) \to G(x) \frac{(-\ln G(x))^k}{k!}

Proof:

I am following the logic of Smith, Weissman book or Leadbetter, Lindgren, Rootzen book.

nS(anx+bn)nlnG(x)nS(a_n x + b_n) \xrightarrow{n \to \infty} - \ln G(x), where G(x)G(x) is the generalized EVD.

We know that binomial distribution converges to Poisson upon nn \to \infty and small kk:

Fξ(k)(anx+bn)=CnkFξ(x)k(1Fξ(x))nkneλλkk!F_{\xi_{(k)}}(a_n x + b_n) = C_n^k F_\xi(x)^k (1 - F_\xi(x))^{n-k} \xrightarrow{n \to \infty} e^{-\lambda} \frac{\lambda^k}{k!}, where λnlnG(x)\lambda \xrightarrow{n \to \infty} - \ln G(x).

Then Fξ(k)(anx+bn)e(lnG(x))(lnG(x))kk!=G(x)(lnG(x))kk!F_{\xi_{(k)}}(a_n x + b_n) \to e^{-(-\ln G(x))} \frac{(-\ln G(x))^k}{k!} = G(x) \frac{(-\ln G(x))^k}{k!}

Hill’s estimator

Hill’s estimator is based on order statistics and is derived from Renyi’s representation theorem.

It is applicable only in Frechet case γ>0\gamma > 0. In case of Gumbel and Inverse Weibull it returns nonsense results. However, there are similar estimators for them.

Hill’s estimator provides an estimate of the tail index: γ^=1nki=knlnxixn\hat{\gamma} = \frac{1}{n - k} \sum \limits_{i=k}^{n} \ln \frac{x_i}{x_n}.

Derivation of Hill’s estimator is based on transformation theorem for probability densities and Renyi’s representation theorem. Let’s prove them first.

Proposition 6.1. Transformation theorem for probability densities

Given a density function f(x1,x2,...,xn)f(x_1, x_2, ..., x_n) and a transformation of variables T:(x1,x2,...,xn)(y1,y2,...,yn)T: (x_1, x_2, ..., x_n) \to (y_1, y_2, ..., y_n),

we can calculate the probability of any event in new variables as:

D(x1,x2,...,xn)f(x1,x2,...,xn)dx1dx2...dxn=D(y1,y2,...,yn)Jf(y1,y2,...,yn)dy1dy2...dyn\iint \limits_{D(x_1, x_2, ..., x_n)} f(x_1, x_2, ..., x_n) dx_1 dx_2 ... dx_n = \iint \limits_{D(y_1, y_2, ..., y_n)} |J| \cdot f(y_1, y_2, ..., y_n) dy_1 dy_2 ... dy_n,

where J|J| is a Jacobian determinant det(x1y1x2y1...xny1x1y2x2y2...xny2............x1ynx2yn...xnyn) \det \begin{pmatrix} \frac{\partial{x_1}}{\partial{y_1}} && \frac{\partial{x_2}}{\partial{y_1}} && ... && \frac{\partial{x_n}}{\partial{y_1}} \\ \frac{\partial{x_1}}{\partial{y_2}} && \frac{\partial{x_2}}{\partial{y_2}} && ... && \frac{\partial{x_n}}{\partial{y_2}} \\ ... && ... && ... && ... \\ \frac{\partial{x_1}}{\partial{y_n}} && \frac{\partial{x_2}}{\partial{y_n}} && ... && \frac{\partial{x_n}}{\partial{y_n}} \end{pmatrix}.

This result is obvious, basically.

Next we go to the Renyi representation theorem, importance of which is well stated in Robert Reitano’s book, page 55:

The essence of this representation theorem is that each of the k-th order statistics, or any sequential grouping of kth order statistics, can be generated sequentially as a sum of independent exponential random variables. This is in contrast to the definitional procedure whereby the entire collection {Xj}j=1M\{X_j\}_{j=1}^M must be generated, then reordered to {X(j)}j=1M\{ X_{(j)} \}_{j=1}^M to identify each variate. To generate a larger collection requires more variates, {Xj}j=1M+1\{X_{j}\}_{j=1}^{M+1} and then a complete reordering of the entire collection {X(j)}j=1M+1\{X_{(j)}\}_{j=1}^{M+1}

With Renyi’s result we can directly generate X(1)X_{(1)}, then X(2)X_{(2)} and so forth. For example, X(1)X_{(1)} is exponential with parameter MλM\lambda; while X(2)X_{(2)} is a sum of X(1)X_{(1)} and an independent exponential variate with parameter (M1)λ(M - 1)\lambda; and so forth.

Lemma 6.1: Renyi representation theorem

Let {ξn}\{\xi_n\} be samples from a standard exponential distribution with parameter λ\lambda.

Let {ξ(n)}\{\xi_{(n)}\} be corresponding order statistic, so that ξ(1)<ξ(2)\xi_{(1)} < \xi_{(2)}, ξ(2)<ξ(3)\xi_{(2)} < \xi_{(3)}, …, ξ(n1)<ξ(n)\xi_{(n-1)} < \xi_{(n)}.

Then order statistics can be represented as:

ξ(i)=1λ(E1n+E2n1+...+Eini+1)=1λm=1iEmnm+1\xi_{(i)} = \frac{1}{\lambda} (\frac{E_1}{n} + \frac{E_2}{n-1} + ... + \frac{E_i}{n-i+1}) = \frac{1}{\lambda} \sum\limits_{m=1}^i \frac{E_m}{n - m + 1}, where EiE_i are independent stadard exponentially-distributed variables.

Proof:

I am following the logic of the proof from Reitano textbook, proposition 2.2.3, it makes use of cumulative distribution functions. Alternatively, one can follow Embrechts (1997) textbook, Example 4.1.5, it makes use of density function.

Consider the probability density of obtaining the given order statistic (it can be obtained in n!n! permutations of the unordered set of observed points {x1,...,xn}\{x_1, ..., x_n\} ):

f{ξ(1),...,ξ(n)}=n!ei=1nxif_{\{ \xi_{(1)}, ..., \xi_{(n)} \}} = n! \cdot e^{-\prod \limits_{i=1}^{n} x_i}

Now consider the gaps between consecutive order statistics, multiplied by consecutive numbers: ξ(1)\xi_{(1)}, (ξ(2)ξ(1))(\xi_{(2)} - \xi_{(1)}), …, (ξ(n)ξ(n1))(\xi_{(n)} - \xi_{(n-1)}).

p(ξ(1)a1,ξ(2)ξ(1)a2,...,ξ(n)ξ(n1)an)=n!xn1xn1+an ⁣x1x1+a2a1f(x1)dx1f(x2)dx2f(xn)dxnp(\xi_{(1)} \le a_1, \xi_{(2)} - \xi_{(1)} \le a_2, ..., \xi_{(n)} - \xi_{(n-1)} \le a_n) = n! \int \limits_{x_{n-1}}^{x_{n-1} + a_n} \dots \int \limits_{x_{1}}^{x_{1} + a_2} \int \limits_{-\infty}^{a_1} f(x_1) dx_1 f(x_2) dx_2 \dots f(x_n) dx_n

Start with the outermost integral:

xn1xn1+anf(xn)dxn=F(xn1+an)F(xn1)\int \limits_{x_{n-1}}^{x_{n-1} + a_n}f(x_n)dx_n = F(x_{n-1} + a_n) - F(x_{n-1})

(F(xn1+an)F(xn1))f(xn1)dxn1=(1eλ(xn1+an)1+eλxn1)λeλxn1=(1eλan)λe2λxn1=12(1eλan)f2(xn1)(F(x_{n-1} + a_n) - F(x_{n-1})) f(x_{n-1})dx_{n-1} = (1 - e^{-\lambda (x_{n-1} + a_n)} - 1 + e^{-\lambda x_{n-1}}) \cdot \lambda e^{-\lambda x_{n-1}} = (1 - e^{-\lambda a_n}) \lambda e^{-2 \lambda x_{n-1}} = \frac{1}{2} (1 - e^{-\lambda a_n}) f_2(x_{n-1}),

where f2(x)=2λe2λxf_2(x) = 2 \lambda e^{-2 \lambda x} is the probability density of exponential distribution with a parameter 2λ2\lambda.

Repeat this procedure with the next integral:

(1eλan)xn2xn2+an112f2(xn1)dxn1f(xn2)dxn2=12(1eλan)(1e2λ(xn2+an1)1+e2λxn2)f(xn2)dxn2=12(1eλan)(1e2λan1)e2λxn2(λeλxn2)dxn2=13!(1eλan)(1e2λan1)f3(xn2)dxn2(1 - e^{-\lambda a_n}) \int \limits_{x_{n-2}}^{x_{n-2} + a_{n-1}} \frac{1}{2} f_2(x_{n-1}) dx_{n-1} f(x_{n-2}) dx_{n-2} = \frac{1}{2} (1 - e^{-\lambda a_n}) (1 - e^{-2\lambda (x_{n-2} + a_{n-1})} - 1 + e^{-2\lambda x_{n-2}}) f(x_{n-2}) dx_{n-2} = \frac{1}{2} (1 - e^{-\lambda a_n}) (1 - e^{-2\lambda a_{n-1}}) e^{-2 \lambda x_{n-2}} (\lambda e^{-\lambda x_{n-2}}) dx_{n-2} = \frac{1}{3!} (1 - e^{-\lambda a_n}) (1 - e^{-2\lambda a_{n-1}}) f_3(x_{n-2}) dx_{n-2}

Repeating this procedure for the full integral, we get:

p(ξ(1)a1,ξ(2)ξ(1)a2,...,ξ(n)ξ(n1)an)=n!xn1xn1+an ⁣x1x1+a2a1f(x1)dx1f(x2)dx2f(xn)dxn=n!1(n1)!i=1n1(1eiλani+1)a1fn1(x1)dx1=n!n!i=1n(1eiλani+1)=i=1nFi(ani+1)p(\xi_{(1)} \le a_1, \xi_{(2)} - \xi_{(1)} \le a_2, ..., \xi_{(n)} - \xi_{(n-1)} \le a_n) = n! \int \limits_{x_{n-1}}^{x_{n-1} + a_n} \dots \int \limits_{x_{1}}^{x_{1} + a_2} \int \limits_{-\infty}^{a_1} f(x_1) dx_1 f(x_2) dx_2 \dots f(x_n) dx_n = n! \frac{1}{(n-1)!} \prod\limits_{i=1}^{n-1} (1-e^{-i \lambda a_{n-i+1}}) \int \limits_{-\infty}^{a_1} f_{n-1}(x_{1})dx_1 = \frac{n!}{n!} \prod\limits_{i=1}^{n} (1-e^{-i \lambda a_{n-i+1}}) = \prod\limits_{i=1}^{n} F_i(a_{n-i+1})

We see that cumulative distribution functions of gaps are independent and ii-th has an exponential distribution with rate parameter (ni+1)λ(n-i+1)\lambda.

Equivalently, if EiE_i is a standard exponential distribution with rate parameter 1, one can convert it to an exponential distribution with rate λ\lambda by E=XλE = \frac{X}{\lambda}, because:

fE(y)=P(E=y)=P(X/λ=y)=P(X=λy)=fX(λy)=exp(λy)f_E(y) = P(E = y) = P(X / \lambda = y) = P(X = λy) = f_X(\lambda y) = exp(- \lambda y)

Hence, ξ(1)=1λE1n\xi_{(1)} = \frac{1}{\lambda} \frac{E_1}{n}, ξ(2)=1λ(E1n+E1n1)\xi_{(2)} = \frac{1}{\lambda} (\frac{E_1}{n} + \frac{E_1}{n-1}) and so on.

Definition 6.2. Upper order statistics

In case of Hill’s estimator we are studying the upper tail of survival function.

Hence, we introduce a counterpart of k-th order statistic, but decreasing, which we call k-th upper order statistic.

In Hill’s estimator we denote x(k)x_{(k)} k-th upper order statistic the k-th largest value, as we’re looking at the upper tail of survival function. So, x(1)>x(2)>...>x(k)x_{(1)} > x_{(2)} > ... > x_{(k)}.

Theorem 6.2: Hill’s estimator is a Maximum Likelihood estimator in Frechet EVD Type II case

If survival function of a distribution ξ\xi lies in the domain of attraction of Frechet Type II EVD, we sampled nn \to \infty largest observations (i.e. upper order statistics), then the shape parameter γ\gamma of EVD can be estimated as:

γ^=1nj=1n(lnξ(i)lnξ(n+1))\hat{\gamma} = \frac{1}{n} \sum \limits_{j=1}^{n} (\ln \xi_{(i)} - \ln \xi_{(n+1)} )

Proof:

I am following the monograph on Hill’s estimator by Jan Henning Volmer and Tsourti textbook chapter 4.

Step 1: connect order statistic from Type II GPD with exponential distribution through quantile transform

Out of thin air consider 1-st order statistic X(1)X_{(1)} of a standard exponential distribution. From Renyi representation:

X(1)=E1nX_{(1)} = \frac{E_1}{n}

Apply a quantile transform to this expression, taking the inverse of c.d.f. of exponential distribution:

FX(X(1))=1eX(1)U(1)F_X(X_{(1)}) = 1 - e^{-X_{(1)}} \sim U_{(1)}, where U(1)U_{(1)} is a 1-st order statistic from a uniform distribution.

FX(U(1))=X(1)=ln(1U(1))F^{\leftarrow}_X(U_{(1)}) = X_{(1)} = - \ln (1 - U_{(1)})

Now, consider our order statistic from Type II Generalized Pareto Distribution: {ξ(1),...,ξ(n)}\{\xi_{(1)}, ..., \xi_{(n)}\}.

We can apply quantile transform to our order statistics as well: Fξ(ξ(i))U(i)F_\xi(\xi_{(i)}) \sim U_{(i)}, which converts the data from (0,+)(0, +\infty) domain to [0,1] probabilities domain, resulting in an order statistic, sampled from uniform distribution.

So, through the quantile transformation and uniform distribution we are able to connect Type II GPD to exponential distribution:

E1=nlnFξ(ξ(1))E_1 = -n \ln F_\xi(\xi_{(1)})

Similarly, for other order statistics:

Fξ(ξ(i))=1eX(i)=1ek=inEini+1F_\xi(\xi_{(i)}) = 1 - e^{-X_{(i)}} = 1 - e^{-\sum \limits_{k=i}^{n}\frac{E_i}{n-i+1}}

ln(1Fξ(ξ(i)))=lnSξ(ξ(i))=k=inEink+1\ln (1 - F_\xi(\xi_{(i)})) = \ln S_\xi(\xi_{(i)}) = \sum \limits_{k=i}^{n} \frac{E_i}{n - k + 1}

ln(1Fξ(ξ(i1)))=lnSξ(ξ(i1))=k=i1nEink+1\ln (1 - F_\xi(\xi_{(i-1)})) = \ln S_\xi(\xi_{(i-1)}) = \sum \limits_{k=i-1}^{n} \frac{E_i}{n - k + 1}

Ei=(ni+1)(lnSξ(ξ(i))lnSξ(ξ(i1)))E_i = - (n - i + 1) ( \ln S_\xi(\xi_{(i)}) - \ln S_\xi(\xi_{(i-1)}) )

Step 2: define likelihood, apply transformation and Jacobian

Let the likelihood function be L=p(ξ(1),ξ(2),...,ξ(n))L = p(\xi_{(1)}, \xi_{(2)}, ..., \xi_{(n)}).

In this proof we will twice substitute the variables and express log-likelihood through the new ones. Fist, let us express log-likelihood through E(i)E_{(i)} rather than ξ(i)\xi_{(i)}, applying the transformation theorem for probability densities:

L=p(ξ(1),ξ(2),...,ξ(n))=Jp(E(1),E(2),...,E(n))=Je(E(1)+E(2)+...+E(n))=JenlnSξ(ξ(1))i=2n(ni+1)(lnSξ(ξ(i))lnSξ(ξ(i1)))L = p(\xi_{(1)}, \xi_{(2)}, ..., \xi_{(n)}) = |J| p(E_{(1)}, E_{(2)}, ..., E_{(n)}) = |J| e^{-(E_{(1)} + E_{(2)} + ... + E_{(n)})} = |J| e^{ -n \ln S_\xi(\xi_{(1)}) -\sum \limits_{i=2}^{n} (n - i + 1) ( \ln S_\xi(\xi_{(i)}) - \ln S_\xi(\xi_{(i-1)}) )}, where J|J| is Jacobian determinant.

Step 3: substitute Karamata’s representation of survival function

From Karamata’s representation theorem and necessary and sufficient conditions of convergence to Type II EVD, we know that survival function upon xx \to \infty can be approximated as S(x)xcxαS(x) \xrightarrow{x \to \infty} c x^{-\alpha} due to S(x)xc(x)exα(t)tdtS(x) \xrightarrow{x \to \infty} c(x) e^{-\int \limits_{-\infty}^{x} \frac{-\alpha(t)}{t} dt}, where α(t)tα\alpha(t) \xrightarrow{t \to \infty} \alpha, c(t)tcc(t) \xrightarrow{t \to \infty} c.

Then consider each element of likelihood product L=enlnSξ(ξ(1))i=2n(ni+1)(lnSξ(ξ(i))lnSξ(ξ(i1)))L = e^{ -n \ln S_\xi(\xi_{(1)}) -\sum \limits_{i=2}^{n} (n - i + 1) ( \ln S_\xi(\xi_{(i)}) - \ln S_\xi(\xi_{(i-1)}) )}.

For i=1i = 1:

en(lnSξ(ξ(i)))=enαlnξ(1)e^{-n( \ln S_\xi(\xi_{(i)}) )} = e^{-n \alpha \ln \xi_{(1)}}

For i>1i > 1:

e(ni+1)(lnSξ(ξ(i))lnSξ(ξ(i1)))=e(ni+1)lncξ(i1)αcξ(i)α=e(ni+1)α(lnξ(i)lnξ(i1))e^{-(n-i+1)(\ln S_\xi(\xi_{(i)}) - \ln S_\xi(\xi_{(i-1)}) )} = e^{-(n-i+1) \ln \frac{ c \xi_{(i-1)}^\alpha }{ c \xi_{(i)}^\alpha }} = e^{-(n-i+1) \alpha (\ln \xi_{(i)} - \ln \xi_{(i-1)} )}.

Here we make a second variable substitution. Define gap between logarithms T(i)={nlnξ(i),i=1(ni+1)(lnξ(i)lnξ(i1)),i>1T_{(i)} = \begin{cases} n \ln \xi_{(i)}, i = 1 \\ (n - i + 1) (\ln \xi_{(i)} - \ln \xi_{(i-1)}), i > 1 \end{cases}

Then likelihood takes a form: L=p(ξ(1),ξ(2),...,ξ(n))=p(T(1),T(2),...,T(n))=Jeαi=1nT(i)L = p(\xi_{(1)}, \xi_{(2)}, ..., \xi_{(n)}) = p(T_{(1)}, T_{(2)}, ..., T_{(n)}) = |J| e^{-\alpha \sum \limits_{i=1}^{n} T_{(i)} }.

Step 4: Calculate Jacobian determinant

J=det(T1ξ(1)T1ξ(2)...T1ξ(n)T2ξ(1)T2ξ(2)...T2ξ(n)............Tnξ(1)Tnξ(2)...Tnξ(n))=det(T1ξ(1)0...00T2ξ(2)...0............00...Tnξ(n))=i=1nTiξ(i)=αn|J| = \det \begin{pmatrix} \frac{\partial T_1 }{ \partial \xi_{(1)} } && \frac{\partial T_1 }{ \partial \xi_{(2)} } && ... && \frac{\partial T_1 }{ \partial \xi_{(n)} } \\ \frac{\partial T_2 }{ \partial \xi_{(1)} } && \frac{\partial T_2 }{ \partial \xi_{(2)} } && ... && \frac{\partial T_2 }{ \partial \xi_{(n)} } \\ ... && ... && ... && ... \\ \frac{\partial T_n }{ \partial \xi_{(1)} } && \frac{\partial T_n }{ \partial \xi_{(2)} } && ... && \frac{\partial T_n }{ \partial \xi_{(n)} } \\ \end{pmatrix} = \det \begin{pmatrix} \frac{\partial T_1 }{ \partial \xi_{(1)} } && 0 && ... && 0 \\ 0 && \frac{\partial T_2 }{ \partial \xi_{(2)} } && ... && 0 \\ ... && ... && ... && ... \\ 0 && 0 && ... && \frac{\partial T_n }{ \partial \xi_{(n)} } \end{pmatrix} = \prod \limits_{i=1}^{n} \frac{\partial T_i }{\partial \xi_{(i)}} = \alpha^n

Then likelihood L=Jeαi=1nT(i)=αneαi=1nT(i)L = |J| e^{-\alpha \sum \limits_{i=1}^{n} T_{(i)} } = \alpha^n e^{-\alpha \sum \limits_{i=1}^{n} T_{(i)} }

Step 5: find maximum of log-likelihood

Now, let us find the maximum of log-likelihood:

lnL=ln(αneαi=1nT(i))=nlnααi=1nT(i)\ln L = \ln (\alpha^n e^{-\alpha \sum \limits_{i=1}^{n} T_{(i)} } ) = n \ln \alpha - \alpha \sum \limits_{i=1}^{n} T_{(i)}.

lnLα=nαi=1nT(i)=0\frac{\partial \ln L}{\partial \alpha} = \frac{n}{\alpha} - \sum \limits_{i=1}^{n} T_{(i)} = 0

α=n/i=1nT(i)\alpha = n / \sum \limits_{i=1}^{n} T_{(i)}

Now recall that 1γ=α\frac{1}{\gamma} = \alpha, so that γ=1ni=1nT(i)=1n(nlnξ1+i=2n(ni+1)(lnξ(i)lnξ(i1)))=1n(i=1nlnξ(i)(n+1)ξ(n+1))=1ni=1nlnξ(ni+1)lnξ(n+1)\gamma = \frac{1}{n} \sum \limits_{i=1}^{n} T_{(i)} = \frac{1}{n} ( n \ln \xi_{1} + \sum \limits_{i=2}^n (n - i + 1) (\ln \xi_{(i)} - \ln \xi_{(i-1)}) ) = \frac{1}{n} ( \sum \limits_{i=1}^{n} \ln \xi_{(i)} - (n + 1) \xi_{(n+1)} ) = \frac{1}{n} \sum \limits_{i=1}^{n} \ln \xi_{(n-i+1)} - \ln \xi_{(n+1)}.

Here is a python implementation of Hill’s estimator:

import numpy as np
import math


def Hill_estimator(data, k):
    """
    Returns the Hill Estimators for some 1D data set.
    
    See: https://github.com/alinasode/hill-estimator/blob/main/utils/functions.py
    """
    n = len(data)

    sum = 0
    for i in range(k):   # i = 0, ..., k
        sum += np.log(data[i]) - np.log(data[k-1])
        
    gamma = (1 / k) * sum

    return gamma

Pickands’ estimator

If we find a way to estimate the parameter γ\gamma from the experimental data, we can argue that underlying survival function has a finite or an infinite upper endpoint. This is extremely valuable, e.g. we can prove that human lifespan is fundamentally limited, and that hazard rate has a singularity, approaching some end point at around 117-123 years.

How do we estimate γ\gamma, given experimental data?

Pickands in his 1975 paper suggested an estimator based on order statistic. If we order all the observations xix_i in the ascending order and consider only the tail, starting from the highest kk (at this tail scaled residual lifetimes converge to Generalized Pareto), we can infer γ^\hat{\gamma} from this order statistics as follows.

Theorem 6.3. Pickands’ estimator

The γ\gamma parameter of Generalized Pareto Distribution can be estimated as: γ^=log2(xk+3/4(nk)xk+1/2(nk)xk+1/2(nk)xn)\hat{\gamma} = \log_2 (\frac{ x_{k + 3/4 (n-k)} - x_{k + 1/2 (n-k)} }{ x_{k + 1/2 (n-k)} - x_n }).

Proof:

Step 0 (optional): Representation of Generalized Pareto’s survival function as integral

Generalized Pareto Distribution’s survival function can be represented as S(x)=e0x11+γtdt=e1γ0xdln(1+γt)=e1γ(ln(1+γx))ln1=(1+γx)1/γS(x) = e^{ - \int \limits_{0}^{ x } \frac{1}{1 + \gamma t} dt } = e^{ -\frac{1}{\gamma} \int \limits_{0}^{x}d \ln(1 + \gamma t) } = e^{-\frac{1}{\gamma} (\ln (1 + \gamma x) ) - \cancel{\ln 1} } = (1 + \gamma x)^{-1 /\gamma}.

Step 1: Reciprocal of Generalized Pareto’s survival function

Let us derive the inverse of the survival function of Generalized Pareto. We know that:

S(S(x))=(1+γS(x))1/γS(S^{\leftarrow}(x)) = (1 + \gamma S^{\leftarrow}(x))^{-1/\gamma}

x=(1+γS(x))1/γx = (1 + \gamma S^{\leftarrow}(x))^{-1/\gamma}

xγ=1+γS(x)x^{-\gamma} = 1 + \gamma S^{\leftarrow}(x)

S(x)=xγ1γS^{\leftarrow}(x) = \frac{x^{-\gamma} - 1}{\gamma}

Step 3: Inferring the Pickands’ estimator

Consider two quantiles: median and quartile: S(1/2)=1/2γ1γS^{\leftarrow}(1/2) = \frac{ {1/2}^{-\gamma} - 1 }{ \gamma } and S(1/4)=1/4γ1γS^{\leftarrow}(1/4) = \frac{ {1/4}^{-\gamma} - 1 }{ \gamma }.

S(1/4)S(1/2)=1/4γ1(1/2γ1)γ=1/4γ1/2γγ=1/2γ1/2γ1γS^{\leftarrow}(1/4) - S^{\leftarrow}(1/2) = \frac{ {1/4}^{-\gamma} - 1 - ({1/2}^{-\gamma} - 1) }{ \gamma } = \frac{ {1/4}^{-\gamma} - {1/2}^{-\gamma} } {\gamma} = {1/2}^{-\gamma} \cdot \frac{ {1/2}^{-\gamma} - 1 }{ \gamma }

Divide this by S(1/2)S^{\leftarrow}(1/2), and we get:

S(1/4)S(1/2)S(1/2)=2γ\frac{ S^{\leftarrow}(1/4) - S^{\leftarrow}(1/2) }{ S^{\leftarrow}(1/2) } = {2}^{\gamma}

γ=log2(S(1/4)S(1/2)S(1/2))\gamma = \log_2 (\frac{ S^{\leftarrow}(1/4) - S^{\leftarrow}(1/2) }{ S^{\leftarrow}(1/2) })

Now we choose kk large enough, close enough to nn, so that we assume that the distribution of order statistics is close to Generalized Pareto, and we replace S(1/4)S^{\leftarrow}(1/4) and S(1/2)S^{\leftarrow}(1/2) with xk+3/4(nk)xk+1/2(nk)x_{k + 3/4 (n-k)} - x_{k + 1/2 (n-k)} and xk+1/2(nk)xnx_{k + 1/2 (n-k)} - x_n, resulting in:

γ^=log2(xk+3/4(nk)xk+1/2(nk)xk+1/2(nk)xn)\hat{\gamma} = \log_2 (\frac{ x_{k + 3/4 (n-k)} - x_{k + 1/2 (n-k)} }{ x_{k + 1/2 (n-k)} - x_n }).

One can prove that Pickands estimator is asymptotically normally distributed, hence, it is possible to calculate its confidence intervals. However, this is really tedious (see this paper, Theorem 2.3) and is beyond the scope of this text.

Here is a python implementation of Pickands estimator and its confidence intervals:

from scipy.integrate import quad 


def Pickands_estimator(data, k):
    """Returns Pickands estimator of inverse tail index gamma (shape parameter) and sigma (scale parameter)."""
    n = len(data)
    
    quartile = data[k]
    median = data[2*k]
    full = data[max(0, 4*k - 1)]
    
    gamma = math.log((quartile - median) / (median - full)) / math.log(2)

    def integrand(t):
        return np.exp(-gamma*t)
    
    sigma = (median - full) / quad(integrand, 0, math.log(2))[0]
    
    return gamma, sigma


def get_pickands_ci(genpareto_shape, k, alpha=0.95):
    """
    Calculate confidence intervals for Pickands estimator.

    Parameters:
        genpareto_shape (float): Pickands estimate of genpareto shape parameter.
        k (int): current order statistic (e.g. we are considering n=100000 elements and k=1000).
        alpha (float): Confidence level, e.g. 95%.

    Returns:
        (2-tuple) Lower confidence interval, upper confidence intervale

    This function creates a line chart with the provided data and plots confidence intervals
    around the main line using Seaborn. The 'x' array should correspond to the data arrays
    'data', 'upper_ci', and 'lower_ci', and they must have the same length.
    """
    alpha = 0.95  # confidence level
    two_sigma = 1.96  # almost 2 standard errors correspond to confidence level alpha=0.95

    try:
        pickands_standard_error = genpareto_shape * math.sqrt((2**(2 * genpareto_shape + 1) + 1)) / (2 * (2**(genpareto_shape) - 1) * math.log(2) * math.sqrt(k))
    except ZeroDivisionError:
        pickands_standard_error = np.infty

    pickands_confidence_intervals = (genpareto_shape - two_sigma * pickands_standard_error, genpareto_shape + two_sigma * pickands_standard_error)    

    return pickands_confidence_intervals


genpareto_shape, genpareto_scale = Pickands_estimator(data, math.floor(len(data) / 4))                                                                          
pickands_confidence_intervals = get_pickands_ci(genpareto_shape, k=len(data))
print(pickands_confidence_intervals)

Other estimators

I’ve described two of the best known estimators for EVD/GPD parameters in detail.

However, I want to briefly mention several alternatives:

  • Maximum likelihood (MLE)
  • Method of moments (MOM)
  • Probability-weighted moments (PWM)
  • L-moments (paper)
  • Principle of maximum entropy (POME) (paper)
  • Refined Pickands estimator (paper)
  • Hill-like estimator for arbitrary tail index values range (paper)

Some of them are well-explained in a Tsourti book, other are described in their respective papers.


7. Summary and examples of practical application

Speaking informally:

  • If a distribution’s survival function has no end point and it decays polynomially (has “fat tails”), the distribution belongs to EVD type II (Frechet).
  • If a distribution’s survival function has a finite end point, and it decays polynomially, approaching that end point, the distribution belongs to EVD type III (Weibull).
  • If a distribution’s survival function decays exponentially (has “light, sub-exponential tails”), approaching its end point, which might be either finite or infinite, it belongs to EVD type I (Gumbel).

For instance:

  • Pareto, Cauchy, Student and Fisher distributions have fat tails and no finite endpoint and converge to Type II.
  • Uniform and Beta distributions have fat tails and a finite end point and converge to Type III.
  • Gaussian, Exponential, Gamma and Logistic distributions have light sub-exponential tails and converge to Type I.

EVD Type I: Gumbel distribution examples

In case of Type I the end point might be finite or infinite.

Example 7.1. Exponential distribution

An example of distribution with infinite end point xFx_F we can consider exponential distribution F(x)=1exF(x) = 1 - e^{-x} for x>0x > 0. We can show that its maximum converges to Type I by choosing an=1a_n = 1, bn=lognb_n = \log n, so that we get Fn(x+logn)=(1e(x+logn))n=(1exn)nexp(ex)F^n(x + \log n) = (1 - e^{-(x + \log n)})^n = (1 - \frac{e^{-x}}{n})^n \to exp(e^{-x}).

Example 7.2 Gnedenko’s example

An example of distribution with a finite end point xFx_F is from Gnedenko (1943) work:

F(x)={0,x<01exp(x1x),0x<11,x1F(x) = \begin{cases} 0, x < 0 \\ 1 - \exp(\frac{-x}{1-x}), 0 \le x < 1 \\ 1, x \ge 1 \end{cases}

and an=1(1+logn)2a_n = \frac{1}{(1 + \log n)^2} and bn=logn1+lognb_n = \frac{ \log n }{ 1 + \log n }

We go from y[0,]y \in [0, \infty] to x[0,1]x \in [0, 1], so that y=x1xy = \frac{x}{1-x} and, conversely, yyx=xy - yx = x and x(y)=y1+yx(y) = \frac{y}{1 + y}.

Consider an exponential random variable η\eta: p(ηy)=1ey=1ex1xp(\eta \le y) = 1 - e^{-y} = 1 - e^{-\frac{x}{1-x}}.

Consider p(ηy)=p(ηx1x)=p(ηηxx)=p(η1+ηx)=1ex1xp(\eta \le y) = p(\eta \le \frac{x}{1-x}) = p(\eta - \eta x \le x) = p(\frac{\eta}{1 + \eta} \le x) = 1 - e^{-\frac{x}{1-x}}

Hence, we can denote our random variable of interest ξ=η1+η\xi = \frac{\eta}{1 + \eta} and we are looking for the maximum maxξn\max \xi_n of those variables.

We need to subtract lnn\ln n in order to generate 1ey/n1 - e^{-y}/n, so that (1ey/n)n=eey(1 - e^{-y}/n)^n = e^{-e^{-y}}. Hence, we will need to consider p(η(y+lnn))=1eylnn=1ey/np(\eta \le (y + \ln n)) = 1 - e^{-y - \ln n} = 1 - e^{-y}/n, and, thus, we know that a maximum of shifted random variable ηη0\eta - \eta_0 converges to Gumbel Type I EVD, where η0=lnn\eta_0 = \ln n.

Now, we need to express anξbna_n \xi - b_n through ηη0\eta - \eta_0, knowing that ξ=η1+η\xi = \frac{\eta}{1 + \eta}.

Denote g(y)=y1+yg(y) = \frac{y}{1 + y}. We know that ξ=g(η)\xi = g(\eta).

Consider ξξ0=g(η)g(η0)\xi - \xi_0 = g(\eta) - g(\eta_0). Using Taylor series at η0\eta_0, we have g(η)g(η0)=g(η0)(ηη0)+O(ηη0)2g(\eta) - g(\eta_0) = g'(\eta_0) (\eta - \eta_0) + O(\eta - \eta_0)^2.

Then ξξ0=1(1+lnn)2(ηη0)\xi - \xi_0 = \frac{1}{ (1 + \ln n)^2 } (\eta - \eta_0). So (ξξ0)(1+lnn)2ηη0(\xi - \xi_0) (1 + \ln n)^2 \to \eta - \eta_0, where ξ0=g(η0)=lnn1+lnn\xi_0 = g(\eta_0) = \frac{\ln n}{1 + \ln n}.

Hence, max(ξlnn1+lnn)(1+lnn)2eex\max (\xi - \frac{\ln n}{1 + \ln n}) (1 + \ln n)^2 \sim e^{-e^{-x}}.

Example 7.3. Gompertz, shifted Gompertz distribution and mortality

When descrtibing human survival function, they often use Gompertz model, which is a special case of Gumbel distribution.

This model works well until ~80-90 years, later it is incorrect, as the actual human survival function has fat tails, while in this model the tails are exponential and decay too fast.

However, it is still very useful for predicting excess mortality during epidemics for countries like Russia, where generation born in 1941-1945 is almost non-existent due to World War II, so its mortality needs to be imputed, if we were to predict the country’s losses from Covid.

In Gompertz survival model the hazard rate, a.k.a. force of mortality is assumed r(t)=A+Bct=A+Betlncr(t) = A + B c^t = A + B e^{t \cdot \ln c}.

This leads to cumulative hazard rate: R(t)=At+Blnc(etlnc1)R(t) = At + \frac{B}{\ln c} (e^{t \cdot \ln c} - 1).

Which results in a survival function S(t)=exp(R(t))=exp(AtBlnc(etlnc1))S(t) = exp(-R(t)) = exp(-A t - \frac{B}{\ln c} (e^{t \cdot \ln c} - 1) ).

This is a special kind of shifted, scaled Gumbel distribution, mirrored around y axis.

Example 7.4. Karlin-Altschul statistics in bioinformatics

Suppose that you’re using your IDE, e.g. PyCharm to perform a fuzzy search of a random string “HomerSimpson” against a codebase. What’s the baseline probability of finding a text, where 10 letters out of 12 exactly match your query?

Turns out, this probability follows a variant of Gumbel distribution, called Karlin-Altschul statistics. This fact was proven theoretically by Karlin and Dembo for Hamming distances between query and database, and Altschul has empirically shown that it holds even for Levenshtein distances with insertions/deletions allowed.

Karlin-Altschul statistics is in the core of BLAST software, used for searching known homologs of biological sequences in databases of already known sequences.

Example 7.5. Gumbel Softmax VAE

Gumbel distribution is sometimes used in neural networks in Gumbel softmax nodes in order to slightly relax categorical features, making it differentiable. E.g. see Gumbel Softmax VAE blog post or a newish paper.

EVD Type II: Frechet distribution

Example 7.6. Pareto distribution, Lindy effect

What’s the chance that the richest of your friends would be a billionaire?

Suppose that you’re an average citizen and that the distribution of wealth in your country is Pareto:

F(x)=11x/λkF(x) = 1 - \frac{1}{x/\lambda}^k, where parameter λ\lambda is conventionally called scale and k>0k > 0 - shape. Alternatively, it can be represented as F(x)=1x/λkF(x) = 1 - {x/\lambda}^{-k}.

This distribution behaves more or less like a “faster hyperbola”, e.g. S(x)=1x2S(x) = \frac{1}{x^2} would be a Pareto.

Maximum of this distribution could be shown to converge to (1F(x)/n)n=exλk(1 - F(x)/n)^n = e^{-{\frac{x}{\lambda}}^{-k}}.

EVD Type III: Inverse Weibull distribution

Example 7.7. Strength of materials

Seemingly identical pieces of materials break at different strengths due to randomness in the material. Weibull distribution describes the probability of a material to break at a strength tt, representing it as a chain of nn identical pieces.

Suppose that probability of each element of chain to break at a strength tt (hazard rate) is polynomial: h(t)=αtα1h(t) = \alpha \cdot t^{\alpha -1}.

This leads up to polynomial cumulative hazard rate H(t)=0th(s)ds=0tdsαH(t) = \int \limits_0^t h(s) ds = \int \limits_0^t d s^\alpha.

Then the survival function of the whole chain is Weibull: S(t)=etαS(t) = e^{-t^\alpha}.

Here α\alpha is known as Weibull modulus.

A more direct approach to derive this result is a weakest link approach. Say, you have a rod of steel of length ll. You split it into a chain of nn links of identical length Δl=ln\Delta l = \frac{l}{n}. Assume that the probability that a link breaks, when a force tt is applied to it, is p(t)=Δlltα=tα/np(t) = \frac{ \Delta l } {l} t^\alpha = t^\alpha / n. Then probability that the rod does not break, when force tt is applied to it is (1p(t))n=(1tαn)netα(1 - p(t))^n = (1 - \frac{t^\alpha}{n}) \xrightarrow{n \to \infty} e^{-t^\alpha}.

Example 7.8. Fraction of mass, occupied by particles of a certain size in mining

Another interesting case, where Weibull distribution arises is distribution of particle sizes in mining. Rozen and Rammler in 1933 empirically found out this distribution in the data, describing crushing of coal.

Denote n(m)n(m) the number of particles of mass between mm and m+Δmm + \Delta m.

Denote mm' a bigger mass, so that at each moment of time we observe events of fragmentation of particles of mass mm' resulting in creation of particles of size mm: f(mm)f(m' \to m).

Integrating over all masses mm', we get the number of particles n(m)n(m) of mass mm in the next moment of time:

n(m)=Cmn(m)f(mm)dmn(m) = C \int \limits_{m}^{\infty} n(m') f(m' \to m) dm'

Assume that the probability to get a particle of mass mm from mm' is Pareto:

f(mm)=(mm0)γf(m' \to m) = (\frac{m}{m_0})^{-\gamma}

And set the constant C=1m0C = \frac{1}{m_0}. Substituting this into the formula for n(m)n(m), we get:

n(m)=mm0γmn(m)d(mm0)n(m) = \frac{m}{m_0}^{-\gamma} \int \limits_{m}^{\infty} n(m') d(\frac{m'}{m_0})

And if you differentiate this equation, you see that solution for n(m)n(m) is an exponent:

n(m)=Mm0(mm0)γe(m/m0)γ+1γ+1n(m) = \frac{M}{m_0} (\frac{m}{m_0})^{-\gamma} e^{-\frac{ (m/m_0)^{\gamma + 1} }{\gamma + 1}}, where M=0n(m)dmM = \int \limits_0^\infty n(m) dm is the total mass.

This produces a survival function S(m)M=mn(m)dm0n(m)dm=exp(1γ+1mm0γ+1)\frac{S(m)}{ M } = \frac{\int \limits_m^\infty n(m) dm}{\int \limits_0^\infty n(m) dm} = exp(-\frac{1}{-\gamma + 1} \frac{m}{m_0}^{-\gamma+1})

See this paper by Brown and Wohletz for more details.

Example 7.9. Human longevity

While survival function of human is well-described by Gompertz distribution, related to Type I Gumbel distribution in the middle-to-old age, when it comes to the late age and the long-livers, the tail of survival function becomes fat and behaves like the tail of Weibull distribution.

Suppose that we want to infer the parameters of our Weibull distribution from our data directly, e.g. using non-parametric methods. E.g. we want to find out the upper end of human life span. The data on super-long-livers (e.g. older than 110 y.o.) is very scarce (~5 people out of 300,000 older than 90 y.o.). Hence, if we use those super-long-livers directly, the error in out data would be huge, as we have just ~5 data points.

However, as I’ve mentioned in part 5, if we apply Pickands-Balkema-de Haan theorem, we can extrapolate the data about results that we observe for 90 y.o. people (of whom we have hundreds of thousands of observations) to derive the parameters of Generalized Pareto Distribution, which hold for 110 y.o. De Haan et al. did this for a dataset of Dutch long-livers, resulting in a human life span limit of 117-123 years, no trend in terminal life span (we can’t live longer than that thanks to the progress of medicine) and, obviously, negative tail index γ\gamma, corresponding to Inverse Weibull domain of attraction of survival function.

Example 7.10. Value at Risk (VaR) in finance

Say, you manage a portfolio of assets in finance. At each moment of time you can consider the change of price of your assets as a random variable. How much value can your assets lose in case the 5%-worst outcome materializes? 2%? 10%?

This value is called Value at Risk (VaR). How to calculate it?

Again, we can use a fully non-parametric method and use the sample quantile of level 95% as an approximation of distribution’s 95% quantile. But the amount of data that we’d have might be very small.

Alternatively, we can use Pickands-Balkema-de Haan theorem to estimate Value at Risk. Say, we have lots of empirical data and can reliably estimate the quantile of level α\alpha (e.g. α=90\alpha = 90%) uu. At the same time quantile of level pp (e.g. p=99p = 99%) is a rare event, and estimation of its quantile x+ux + u is harder.

Then:

p(ξx+u)p(ξu)=(1+γxσ)1/γ\frac{ p(\xi \ge x + u) }{ p (\xi \ge u) } = (1 + \gamma \frac{x}{\sigma})^{-1/\gamma}

p/α=(1+γxσ)1/γp / \alpha = (1 + \gamma \frac{x}{\sigma})^{-1/\gamma}

pαγ=1(1+γxσ)\frac{p}{\alpha}^\gamma = \frac{1}{(1 + \gamma \frac{x}{\sigma} )}

1+γx/σ=αpγ1 + \gamma x / \sigma = \frac{\alpha}{p}^\gamma

x=σγ(αpγ1)x = \frac{\sigma}{\gamma} (\frac{\alpha}{p}^\gamma - 1)

VaR=u+x=u+σγ(αpγ1)VaR = u + x = u + \frac{ \sigma }{ \gamma } (\frac{\alpha}{p}^\gamma - 1)

See this paper with application of this approach to time series prediction with ARMA-GARCH.

8. Concluding remarks

In this post we’ve explored the EVD for the case of maximum of i.i.d. random variables. There are many related problems and extensions, however.

I won’t cover them in detail here, as this post is way too long already. Please, refer to books such as Resnick, Leadbetter or Smith & Weissman for this material. However, I want to mention some of the most famous related problems.

Weibull vs Inverse Weibull vs Frechet distributions

In survival analysis one often runs into Weibull distribution, not Inverse Weibull (Type III EVD) distribution.

Recall that Inverse Weibull’s probability density is non-zero for negative values of xx, and cumulative distribution function achieves maximum as x0x \to 0. For all positive xx cdf is already 1.

The regular Weibull mirrors probability density of Inverse Weibull around y-axis and mirrors cumulative distribution function between x=0x = 0 and x=1x = 1. I.e. for regular Weibull distribution cdf is:

F(x)=1e(xλ)k=11(ex/λ)kF(x) = 1 - e^{-(\frac{x}{\lambda})^{k}} = 1 - \frac{1}{(e^{x / \lambda})^k}.

Hence, probability density equals: f(x)=(xλ)k1(k1λ)(1(ex/λ)k)=kλ(xλ)k11(ex/λ)kf(x) = -(\frac{x}{\lambda})^{k-1} \cdot (k \cdot \frac{1}{\lambda}) \cdot (- \frac{1}{(e^{x / \lambda})^k}) = \frac{k}{\lambda} \cdot (\frac{x}{\lambda})^{k-1} \cdot \frac{1}{(e^{x / \lambda})^k}.

import math

import numpy as np
import matplotlib.pyplot as plt


shape = 5  # alpha
scale = 5  # beta

# Generate x values from 0.1 to 20 with a step size of 0.1
x = np.arange(0.1, 20, 0.1)

# Calculate y values
weibull_cdf = 1 - math.e**(-(x/scale) ** shape)
weibull_pdf = (shape / scale) * ((x / scale) ** (shape - 1)) * np.exp(-((x / scale) ** shape))

# Create the figure and axis objects
fig, ax = plt.subplots(figsize=(12,8), dpi=100)

# Plot cdf
ax.plot(x, weibull_cdf, label='cdf')

# Plot pdf
ax.plot(x, weibull_pdf, label='pdf')
                                               
# Set up the legend
ax.legend()

# Set up the labels and title
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Plot of Weibull pdf and cdf')

# Display the plot
plt.show()

Weibull distriubtion plot

Weibull distribution plot with shape=5, scale=5.

Wikipedia somewhat incorrectly claims that Frechet distribution is the same as Inverse Weibull distribution. As we’ve seen earlier, Frechet distribution has k-k (negative number) in the power, while Inverse Weibull has positive kk. Also, Frechet distribution has non-zero pdf at positive values of xx, while Frechet has non-zero pdf at negative values of xx.

Extreme value of minimum

Distribution of minimum behaves in a similar way to the distribution of maximum with some inversions of EVDs.

Multi-variate case and continuous-time stochastic processes

Instead of discrete number of random variables nn you might want to work with stochastic processes with continuous time tt. Similar results can be derived for them.

Non-i.i.d. random variables

As with the extensions of Central Limit Theorem, random variables might not be exactly independent identically distributed. Still extreme theorem might hold for these generalized cases.


References:

This text is mostly based on the contents of textbooks by these three titans:

Sidney I. Resnick Laurens de Haan Malcolm Ross Leadbetter
Resnick Laurens de Haan Leadbetter

Boris Burkov

Written by Boris Burkov who lives in Moscow, Russia, loves to take part in development of cutting-edge technologies, reflects on how the world works and admires the giants of the past. You can follow me in Telegram