Take some distribution (e.g., exponential) and any point, e.g., 3.
Let’s calculate the true \(F(x)\) there.
Next, for different \(n\), let’s calculate \(\widehat{F_n}(x)\).
We plot the distance \(\widehat{F_n}(x) - F(x)\). The larger \(n\) is, the smaller the distance must be.
Code
def calc_Fn(sample, x):""" The function constructs an empirical distribution function over sample for Fn(x). """return numpy.mean(sample <= x)numpy.random.seed(42)N =5000n_array = numpy.arange(10, N, 10)# We took an exponential distribution as an exampleX = expon().rvs(N)x =3F_x = expon().cdf(x)delta = []for n in n_array:# Calculate F_n_x from the first n generated points F_n_x = calc_Fn(X[:n], x) delta.append(F_n_x - F_x)pyplot.figure(figsize=(10, 5))pyplot.title('Distribution $\widehat{F_n}(x) - F(x)$, x is fixed', fontsize=12)pyplot.plot(n_array, delta, c=red_pink, label='$\widehat{F_n}(x) - F({x})$')pyplot.plot(n_array, [0] *len(n_array), c=turquoise, linewidth=3, label='0')pyplot.legend(fontsize=12)pyplot.xlabel('Sample size', fontsize=12)pyplot.ylabel('$\delta$', fontsize=12)pyplot.grid(linewidth=0.2)pyplot.show()
📊 Glivenko-Cantelli
How does \(\widehat{F_n}(x)\) approximate \(F(x)\) at each point x?
Theorem (Glivenko-Cantelli):
Let \(X_1, X_2 \dots\) — a sample of unlimited size from a distribution with distribution function \(F\). Then
This expression \(D_n\) denotes the maximum difference between \(\hat{F}_n(x)\) and \(F(x)\) for all possible values of x.
📊 Glivenko-Cantelli visualization
Generating a sample.
Count \(F(x)\) at each point of \(x\).
Count \(\widehat{F_n}(x)\) at each point of \(x\).
Plot the graphs as a function of \(n\).
Code
numpy.random.seed(4)N =1000n_array = [10, 100]X = expon().rvs(N)# This time it's not a single point, it's a number linex = numpy.linspace(0, 5, 1000)F_x = expon().cdf(x)pyplot.figure(figsize=(25, 8))for ind, n inenumerate(n_array): F_n_x = numpy.array([calc_Fn(X[:n], x_dot) for x_dot in x])# Draw Dn on the graph# Calculate delta delta = numpy.abs(F_n_x - F_x)# Looking for the maximum value index x_ind = numpy.argmax(delta)# By the index we get the point x at which the largest distance dn_x_dot = x[x_ind]# Looking for the lower and upper points down_bound =min(F_n_x[x_ind], F_x[x_ind]) up_bound =max(F_n_x[x_ind], F_x[x_ind]) pyplot.subplot(1, 2, ind +1) pyplot.title('$\widehat{F_n} - F$, '+f'n={n}', fontsize=16) pyplot.plot(x, F_x, color="orange", label='F(x)') pyplot.plot(x, F_n_x,color="blue", label='$\widehat{F_n}(x)$') pyplot.vlines(dn_x_dot, down_bound, up_bound, color='maroon', linestyle='-', linewidth=3, label='$D_n$') pyplot.legend(fontsize=16) pyplot.grid(linewidth=0.2) pyplot.xlabel('x', fontsize=16) pyplot.ylabel('CDF', fontsize=16)pyplot.show()
The larger \(n\) is, the more accurately we predict \(F(x)\) on the entire numerical line.
📊 Step 2: statistic
Kolmogorov test statistic
Let’s replace \(F(x)\) with an estimate: the empirical distribution function. Then the test statistic will be
where \(F_0(x)\) is the hypothesized distribution function.
\(H_0: F = F_0\).
\(H_1: F \neq F_0\).
We need to find the distribution of \(D_n\) under \(H_0\).
📊 Step 3: Distribution
Distribution of Kolmogorov test statistics
Kolmogorov’s theorem
Let \(X\) be sampled from a continuous distribution. Then at \(H_0\)\(\sqrt{n}D_n \stackrel{d}{\rightarrow} \phi\), where \(\phi\) has a Kolmogorov distribution:
Where \(\stackrel{d}{\rightarrow}\) means convergence in distribution.
That is, as \(n\) grows, \(D_n \rightarrow 0\), and \(\sqrt{n}D_n \rightarrow \phi\).
📊 Step 3: visualization
Let us graphically see how the Kolmogorov statistic is distributed, and how it behaves when n is small. To do this, we:
Generate 1000 experiments and in each of them calculate the statistics \(\sqrt{n}D_n\).
In each experiment, generate a sample.
Let’s calculate the statistic \(D\).
Let’s construct a histogram.
Let’s build the Kolmogorov distribution.
Repeat the same exercise for different sample sizes
Code
def get_D_statistic(sample, cdf_func):""" Calculate the D statistic for the sample. Does not work for discrete distributions. Parameters: - sample: sample - cdf_func: function to build cdf for F0 """# Sort the sample to get intervals X_i, X_i+1 sample = numpy.sort(sample)# We calculate the empirical distribution function F_n_x = numpy.array([calc_Fn(sample, x) for x in sample])# Shift 1 to the right to calculate the difference at point X_i+1 F_n_x_shifted = numpy.concatenate([[0], F_n_x[:-1]])# Build an array of F(X) values F_x = cdf_func(sample) F_x_shifted = cdf_func(sample)# Determine the maximum distance at the left points of the interval X_i left_interval_max = numpy.max(numpy.abs(F_x - F_n_x))# Determine the maximum distance at the right points of the interval X_i+1 right_interval_max = numpy.max(numpy.abs(F_x_shifted - F_n_x_shifted))returnmax(left_interval_max, right_interval_max)numpy.random.seed(8)n_array = [5, 100]distribution = expon()exp_size =10000pyplot.figure(figsize=(25, 8))# For different sample sizes, plot the graphsfor ind, n inenumerate(n_array): statistics = []# Generate exp_size of experimentsfor i inrange(exp_size): X = distribution.rvs(n) statistics.append(get_D_statistic(X, distribution.cdf) * numpy.sqrt(n)) pyplot.subplot(1, 2, ind +1) pyplot.title(f'Visualization of Kolmogorov statistic distribution, N={n}', fontsize=16) distplot(statistics, label='$\sqrt{n}D_n$') x_array = numpy.linspace(0, numpy.max(statistics), 1000) pyplot.plot(x_array, kstwobign.pdf(x_array), label='Kolmogorov distribution') pyplot.legend(fontsize=16) pyplot.grid(linewidth=0.2) pyplot.xlabel('Statistics', fontsize=16) pyplot.ylabel('Density', fontsize=16)pyplot.show()
There is a slight upward shift, which is good, because the test will be wrong less than \(\alpha\)% of the time.
QQ-plot and Shapiro-Wilk test
QQ-plot
QQ-plot &mdash is a method that allows you to visualize “how well” a normal distribution describes the data.
QQ-plot visualization:
Let the sampling distribution — \(\mathcal{N}(\mu, \sigma^2)\).
Let us find the quantiles of 2 distributions: \(\mathcal{N}(\mu, \sigma^2)\) and \(\mathcal{N}(0, 1)\). Their quantiles will be \(Q, Q'\), respectively.
Let us put the quantiles of the first distribution on the Y-axis and the quantiles of the second distribution on the X-axis.
Then the rule derived above states that all points \((Q'_{\alpha}, Q_{\alpha})\) lie on the same line.
The tangent of the slope of this line is always the same and is equal to \(\frac{\sigma'}{\sigma}\).
Code
mu =5sigma =10sample_distr = norm(loc=mu, scale=sigma)standart_distr = norm(loc=0, scale=1)quantiles = np.arange(0.001, 1, 0.1)quantle_xticks = ["$Q_{"+str(round(x, 2)) +"}$"for x in quantiles]quantle_yticks = ["$Q'_{"+str(round(x, 2)) +"}$"for x in quantiles]standart_q = standart_distr.ppf(quantiles)sample_q = sample_distr.ppf(quantiles)fig = pyplot.figure(figsize=(15, 10))ax = fig.add_subplot(1,1,1)pyplot.title('QQ-Plot, visualization on ideal data', fontsize=15)ax.plot(standart_q, sample_q, label="Line $(Q_{\\alpha}, Q'_{\\alpha})$")for index in [1, 5]: center = (standart_q[index], sample_q[index]) dot1 = (standart_q[index] +1, sample_q[index]) dot2 = (standart_q[index +1], sample_q[index +1]) label=""if index ==1: label="$tan(\phi) = \dfrac{\sigma'}{\sigma}$" am1 = AngleAnnotation(center, dot1, dot2, ax=ax, size=250, text=r"$\phi$", text_kw=dict(fontsize=16)) ax.plot(*zip(*[center, dot1]), color="black", label=label)pyplot.ylabel("Quantiles $\mathcal{N}(\mu, \sigma^2)$", fontsize=15)pyplot.xlabel("Quantiles $\mathcal{N}(0, 1)$", fontsize=15)pyplot.xticks(standart_q, quantle_xticks, fontsize=15)pyplot.yticks(sample_q, quantle_yticks, fontsize=15)pyplot.legend(fontsize=15)pyplot.grid(linewidth=0.2)pyplot.show()
Then, the score of the \(i\)-th quantile — \(i\) value in the sorted sample (recall the empirical distribution function). \(\widehat{F_n}(x) = \dfrac{1}{n}\sum I(X_i \leq x)\).
Let’s construct the quantiles of the standard normal distribution.
To do this, we calculate the quantile at each point \(\dfrac{i}{N},\ i \in {\{1, ...\ N - 1\}}\)
Let’s put the estimates of the quantiles of the first distribution on the Y-axis and the true quantiles of the second distribution on the X-axis.
If the graph resembles a straight line, it is a normal distribution.
Let’s say we wanted to plot the load time of our site in seconds. We took the last 20 values (in seconds) from different users in our logs, and plot the Q-Q plot for this sample.