Z-test

Probability and Statistics

Ihor Miroshnychenko

Kyiv School of Economics

Tests concerning the mean

Why analyze averages?

Revenue growth.
- If the total value of \(\rightarrow\) grows, the average value grows. Therefore, this task can be reformulated as the growth of the average check, or ARPU.
Increase in the number of purchases.
Reducing the outflow of users.

Next, to derive all the criteria, we need a normal distribution. Because this is the distribution that the mean of the samples follows.

Normal Distribution

Normal distribution \(\xi \sim \mathcal{N}(\mu, \sigma^2)\) is a continuous distribution in which the density decreases with increasing distance from \(\mu\) exponentially.

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

where:

\(\mu\) is the offset parameter: how much the center is shifted from 0.
\(\sigma^2\) — scale parameter: how “flat” the distribution graph will be.

Normal distribution

Properties of the normal distribution (independent variables)

\(\xi_1 \sim \mathcal{N}(\mu_1, \sigma_1^2),\ \xi_2 \sim \mathcal{N}(\mu_2, \sigma_2^2) \Rightarrow \\ \xi_1 + \xi_2 \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)\)¹
\(a \xi_1 \sim \mathcal{N}(a\mu_1, a^2\sigma_1^2)\)

1st property of the normal distribution

Generate a sample from the 1st and 2nd distributions and sum them.
Next, we will plot the empirical density of the distribution and compare it with the true density on the graph.

1st property of the normal distribution

Central limit theorem

Let \(\xi_1, ..., \xi_n\) be independent identically distributed random variables:

\(E [\xi_i] = \mu < \infty\)
\(Var[\xi_i] = \sigma^2 < \infty\).

Then \(\sqrt{n}\dfrac{\overline \xi - \mu}{\sqrt{\sigma^2}}\) converges in distribution to \(\mathcal{N}(0, 1)\).

Or we can write it as:

\[ \sqrt{n}\dfrac{\overline \xi - \mu}{\sqrt{\sigma^2}} \sim \mathcal{N}(0, 1) \Leftrightarrow \overline \xi \sim \mathcal{N}\left(\mu, \dfrac{\sigma^2}{n} \right) \]

Central limit theorem and DnD

Let’s say we have a D20 dice from Dungeons and Dragons. Let’s roll it 1000 times and calculate the sum of the values.

Visualization of the CLT

Generate \(N\) samples of \(M\) elements in each:

For each sample, calculate the normalized average over \(M\) elements.
As a result, we get a sample of N elements.
It must be from a normal distribution.

\[\xi \sim \text{Binom}(20, 0.01)\]

\(N = 5000\)

😊

\(N = 40\)

🤨

Equivalent wording of the CLT

\[\begin{align} \sqrt{n}\dfrac{\overline \xi - \mu}{\sqrt{\sigma^2}} &\sim \mathcal{N}(0, 1) \stackrel{prop. 2}{\Leftrightarrow}\\ \overline \xi - \mu &\sim \mathcal{N}\left(0, \dfrac{\sigma^2}{n} \right) \Leftrightarrow\\ \dfrac{\underset{i=1}{\overset{n}{\sum}} \xi_i}{n} &\sim \mathcal{N}\left(\mu, \dfrac{\sigma^2}{n} \right) \Leftrightarrow\\ \underset{i=1}{\overset{n}{\sum}} \xi_i &\sim \mathcal{N}\left(n \cdot \mu, n \cdot \sigma^2 \right) \end{align}\]

Fisher’s \(Z\)-test

📈 Problem 🚚: binomial distribution

\(T(X^n) = \underset{i=1}{\overset{n}{\sum}} X_i,\ T \overset{H_0}{\sim} \text{Binom} (n, \mu_0)\)
Let the realisation of \(T(X^n) = t\).
\(\text{p-value} = P_{H_0}(T(X^n) \geq t) = 1 - P_{H_0}(T(X^n) < t)\)

For our example:

\(n = 30\),
\(\mu_0 = 0.5\),
\(t = 19\).

Then the p-value will be 0.1002442.

📈 Problem 🚚: binomial distribution

For a sufficiently large sample size, \(\underset{i=1}{\overset{n}{\sum}} X_i \sim \mathcal{N}\left(n \cdot \mu_0, n \cdot \sigma^2 \right)\),
\(X_i \overset{H_0}{\sim} \text{Bernoulli} (\mu_0)\)
\(\sigma^2 = \mu_0 \cdot (1 - \mu_0)\)
\(\text{p-value} = P_{H_0}(T(X^n) \geq t)\).

This time, we look at the statistics not at point t-1, as we did before, but at point t. Since we have a continuous distribution, we do not need to subtract 1:

in the case of normal distribution: \(P(T(X^n) \geq t) = P(T(X^n) > t) = 1 - P(T(X^n) \leq t)\);
in the case of binomial distribution: \(P(T(X^n) \geq t) = 1 - P(T(X^n) \leq t - 1)\).

Binomial vs. normal distribution

Let’s calculate the p-value using the normal distribution approximation:

\(P(T(X^n) \geq t) = 1 - P(T(X^n) \leq t)\)

\(n = \color{#e64173}{30}\),
\(\mu_0 = 0.5\),
\(t = 19\).

p-value obtained from the normal distribution approximation: 0.0720635.

p-value obtained using (former) exact formula: 0.1002442.

\(n = \color{#e64173}{3000}\),
\(\mu_0 = 0.5\),
\(t = 1544\).

p-value obtained from the normal distribution approximation: 0.0540653.

p-value obtained using (former) exact formula: 0.0560909.

As we can see, the p-value obtained using the normal distribution approximation is very close to the p-value obtained using the old, correct formula, but only for a sufficiently large sample size.

Fisher’s \(Z\)-test

\(H_0: \mu = \mu_0\ vs.\ H_1: \mu > \mu_0\)

The statistic \(Z(X) = \sqrt{n}\dfrac{\overline X - \mu_0}{\sqrt{\sigma^2}}\)
For a sufficiently large sample size, \(Z(X) \overset{H_0}{\sim} \mathcal{N}(0, 1)\) (according to the CLT)
One-sided test: \(\left\{Z(X) \geq z_{1 - \alpha} \right\}\)
- p-value = \(1 - \Phi(z)\), where \(z\) is the realisation of the statistic \(Z(X)\), \(\Phi(z)\) is the distribution test \(\mathcal{N}(0, 1)\)
Two-sided criterion: \(\left\{Z(X) \geq z_{1 - \frac{\alpha}{2}} \right\} \bigcup \left\{Z(X) \leq -z_{1 - \frac{\alpha}{2}} \right\}\)
- p-value = \(2\cdot \min \left[{\Phi(z), 1 - \Phi(z)} \right]\), where \(z\) is the realisation of the statistic \(Z(X)\)c

Fisher’s \(Z\) test: small sample size

\(n = 30\),
\(\mu_0 = 0.5\),
\(t = 19\).
\(\sigma^2 = \mu_0 \cdot (1 - \mu_0)\).

p-value obtained using the (former) exact formula: 0.1002442.

p-value obtained from the normal distribution approximation: 0.0720635.

Z-test p-value: 0.0720635.

Fisher’s \(Z\) test: a larger sample

z_criterion_pvalue <- function(sample_mean, sample_size, mu0, variance) {
  Z_statistic <- sqrt(sample_size) * (sample_mean - mu0) / sqrt(variance)
  1 - pnorm(Z_statistic)
}

n <- 3000
t <- 1544
mu0 <- 0.5
variance <- mu0 * (1 - mu0)

old_p_value <- get_pvalue_by_old_logic(n, mu0, t)
normal_p_value <- get_pvalue_by_normal_approx(n, mu0, t)
z_pvalue <- z_criterion_pvalue(t / n, n, mu0, variance)

cat("p-value obtained using the old, correct formula:", old_p_value, "\n",
    "p-value obtained from the normal distribution:", normal_p_value, "\n",
    "Z-test p-value:", z_pvalue, "\n")

p-value obtained using the old, correct formula: 0.05609088 
 p-value obtained from the normal distribution: 0.05406527 
 Z-test p-value: 0.05406527

\(n = 3000\),
\(\mu_0 = 0.5\),
\(t = 1544\).
\(\sigma^2 = \mu_0 \cdot (1 - \mu_0)\).

p-value obtained using the old, correct formula: 0.0560909.

p-value obtained from the normal distribution approximation: 0.0540653.

Z-test p-value: 0.0540653.

Continuity correction

Is it possible to refine the results of the \(Z\)-test for a binomial distribution with small sample sizes?

\(Z\)-test: \(p\)-value = 0.07
Accurate \(p\)-value = 0.10

First, let’s visualise the p-value(t) function of the criteria described above:

the p-value of the test based on the normal approximation
- here is a simple formula: you need to implement 1 - pnorm(t)
the p-value of the binomial test. Let’s calculate it in 2 cases:
- t is a noninteger number. Let’s look at an example
  - Let t=19.5. The p-value \(= P(T(X) \geq t) = P(T(X) \geq 19.5) = 1 - P(T(X) < 19.5) =|P(T(X) = 19.5) = 0|= 1 - P(T(X) \leq 19.5)\). Note that the last probability is a distribution function. Therefore.
- t is an integer.
  - Let t = 19. p-value \(= P(T(X) \geq t) = P(T(X) \geq 19) = 1 - P(T(X) < 19) = 1 - P(T(X) \leq 18)\).

Continuity correction

\(p_{\text{binom}} > p_{\text{norm}}\)
As the sample size increases, these values coincide.

Continuity correction

\(n = 20\),
\(\mu_0 = 0.5\),
\(t = 10\).

p-value obtained using the (former) exact formula: 0.5880985.

p-value obtained from the normal distribution: 0.5.

Difference: 0.088.

\(n = 20\),
\(\mu_0 = 0.5\),
\(t = 14\).

p-value obtained using the (former) exact formula: 0.0576591.

p-value obtained from the normal distribution: 0.0368191.

Difference: 0.021.

\(n = 200\),
\(\mu_0 = 0.5\),
\(t = 100\).

p-value obtained using the old, correct formula: 0.5281742.

p-value obtained from the normal distribution: 0.5.

Difference: 0.028.

Continuity correction

\[F_{\text{new}}(x) = F_{\text{old}}(x - 0.5)\]

Continuity correction

\(n = 30\),
\(\mu_0 = 0.5\),
\(t = 19\).

p-value obtained using the old, correct formula: 0.1002442.

p-value obtained from the normal distribution: 0.0720635.

p-value obtained from the normal distribution approximation with a correction: 0.1006213.