Probability and Statistics
Kyiv School of Economics
Probability and Statistics
Next, to derive all the criteria, we need a normal distribution. Because this is the distribution that the mean of the samples follows.
Normal distribution \(\xi \sim \mathcal{N}(\mu, \sigma^2)\) is a continuous distribution in which the density decreases with increasing distance from \(\mu\) exponentially.
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]
where:
\(\xi_1 \sim \mathcal{N}(\mu_1, \sigma_1^2),\ \xi_2 \sim \mathcal{N}(\mu_2, \sigma_2^2) \Rightarrow \\ \xi_1 + \xi_2 \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)\)1
\(a \xi_1 \sim \mathcal{N}(a\mu_1, a^2\sigma_1^2)\)
Let \(\xi_1, ..., \xi_n\) be independent identically distributed random variables:
Then \(\sqrt{n}\dfrac{\overline \xi - \mu}{\sqrt{\sigma^2}}\) converges in distribution to \(\mathcal{N}(0, 1)\).
Or we can write it as:
\[ \sqrt{n}\dfrac{\overline \xi - \mu}{\sqrt{\sigma^2}} \sim \mathcal{N}(0, 1) \Leftrightarrow \overline \xi \sim \mathcal{N}\left(\mu, \dfrac{\sigma^2}{n} \right) \]
Let’s say we have a D20 dice from Dungeons and Dragons. Let’s roll it 1000 times and calculate the sum of the values.
Generate \(N\) samples of \(M\) elements in each:
\[\xi \sim \text{Binom}(20, 0.01)\]
\[\begin{align} \sqrt{n}\dfrac{\overline \xi - \mu}{\sqrt{\sigma^2}} &\sim \mathcal{N}(0, 1) \stackrel{prop. 2}{\Leftrightarrow}\\ \overline \xi - \mu &\sim \mathcal{N}\left(0, \dfrac{\sigma^2}{n} \right) \Leftrightarrow\\ \dfrac{\underset{i=1}{\overset{n}{\sum}} \xi_i}{n} &\sim \mathcal{N}\left(\mu, \dfrac{\sigma^2}{n} \right) \Leftrightarrow\\ \underset{i=1}{\overset{n}{\sum}} \xi_i &\sim \mathcal{N}\left(n \cdot \mu, n \cdot \sigma^2 \right) \end{align}\]
For our example:
Then the p-value will be 0.1002442.
This time, we look at the statistics not at point t-1, as we did before, but at point t. Since we have a continuous distribution, we do not need to subtract 1:
Let’s calculate the p-value using the normal distribution approximation:
\(P(T(X^n) \geq t) = 1 - P(T(X^n) \leq t)\)
p-value obtained from the normal distribution approximation: 0.0720635.
p-value obtained using (former) exact formula: 0.1002442.
p-value obtained from the normal distribution approximation: 0.0540653.
p-value obtained using (former) exact formula: 0.0560909.
As we can see, the p-value obtained using the normal distribution approximation is very close to the p-value obtained using the old, correct formula, but only for a sufficiently large sample size.
\(H_0: \mu = \mu_0\ vs.\ H_1: \mu > \mu_0\)
p-value obtained using the (former) exact formula: 0.1002442.
p-value obtained from the normal distribution approximation: 0.0720635.
Z-test p-value: 0.0720635.
z_criterion_pvalue <- function(sample_mean, sample_size, mu0, variance) {
Z_statistic <- sqrt(sample_size) * (sample_mean - mu0) / sqrt(variance)
1 - pnorm(Z_statistic)
}
n <- 3000
t <- 1544
mu0 <- 0.5
variance <- mu0 * (1 - mu0)
old_p_value <- get_pvalue_by_old_logic(n, mu0, t)
normal_p_value <- get_pvalue_by_normal_approx(n, mu0, t)
z_pvalue <- z_criterion_pvalue(t / n, n, mu0, variance)
cat("p-value obtained using the old, correct formula:", old_p_value, "\n",
"p-value obtained from the normal distribution:", normal_p_value, "\n",
"Z-test p-value:", z_pvalue, "\n")
p-value obtained using the old, correct formula: 0.05609088
p-value obtained from the normal distribution: 0.05406527
Z-test p-value: 0.05406527
p-value obtained using the old, correct formula: 0.0560909.
p-value obtained from the normal distribution approximation: 0.0540653.
Z-test p-value: 0.0540653.
Is it possible to refine the results of the \(Z\)-test for a binomial distribution with small sample sizes?
First, let’s visualise the p-value(t)
function of the criteria described above:
p-value
of the test based on the normal approximation
1 - pnorm(t)
p-value
of the binomial test. Let’s calculate it in 2 cases:
t
is a noninteger number. Let’s look at an example
t=19.5
. The p-value \(= P(T(X) \geq t) = P(T(X) \geq 19.5) = 1 - P(T(X) < 19.5) =|P(T(X) = 19.5) = 0|= 1 - P(T(X) \leq 19.5)\). Note that the last probability is a distribution function. Therefore. t
is an integer.
t = 19
. p-value \(= P(T(X) \geq t) = P(T(X) \geq 19) = 1 - P(T(X) < 19) = 1 - P(T(X) \leq 18)\). p-value obtained using the (former) exact formula: 0.5880985.
p-value obtained from the normal distribution: 0.5.
Difference: 0.088.
p-value obtained using the (former) exact formula: 0.0576591.
p-value obtained from the normal distribution: 0.0368191.
Difference: 0.021.
p-value obtained using the old, correct formula: 0.5281742.
p-value obtained from the normal distribution: 0.5.
Difference: 0.028.
\[F_{\text{new}}(x) = F_{\text{old}}(x - 0.5)\]
p-value obtained using the old, correct formula: 0.1002442.
p-value obtained from the normal distribution: 0.0720635.
p-value obtained from the normal distribution approximation with a correction: 0.1006213.