| Conversion Rate Data | ||
|---|---|---|
| User ID | Visits the ad page | Start a chat conversation |
| 1 | 5 | 5 |
| 2 | 4 | 1 |
| 3 | 0 | 0 |
| 4 | 6 | 4 |
| Total | 15 | 10 |
Applied Statistics
Kyiv School of Economics
You are an analyst at OLX. Recently, a product manager approached you and requested to change the color of the “write a message” button in ads from blue to red.
Question
Will this increase the conversion rate for button clicks?
Conversion rate is defined as the ratio of the number of users who started a chat conversation to the number of users who visited the ad page.
| Conversion Rate Data | ||
|---|---|---|
| User ID | Visits the ad page | Start a chat conversation |
| 1 | 5 | 5 |
| 2 | 4 | 1 |
| 3 | 0 | 0 |
| 4 | 6 | 4 |
| Total | 15 | 10 |
\[ \begin{align*} \text{Conversion Rate} &= \frac{\text{Number of Conversations}}{\text{Number of Visits}} \\ &= \frac{10}{15} = \frac{2}{3} \approx 0.67 \end{align*} \]
For example: users 1 and 3 will be in the control group, and users 2 and 4 will be in the test group.
| Conversion Rate Data | ||
|---|---|---|
| User ID | Visits the ad page | Start a chat conversation |
| 1 | 5 | 5 |
| 2 | 4 | 1 |
| 3 | 0 | 0 |
| 4 | 6 | 4 |
| Total | 15 | 10 |
Which test to use?
Let’s:
Hypothesis:
\[ H_0: \mathbb{E}\overline{M_O^C} = \mathbb{E} \overline{M_O^T}\ VS.\ H_1: \mathbb{E}\overline{M_O^C} < \mathbb{E} \overline{M_O^T} \]
Can we use the t-test here?
What if the entire difference between the groups is due to different users rather than the effect between the test and control groups?
For example, the first user does not like to call and always writes in the chat, while the others do not.
Monte Carlo simulation:
O_array: number of visits to the ad page by each user in the group, generated from a normal distribution.p_array: conversion rate for each user, generated from a uniform distribution.sample_size: total number of visits to the ad page in the group.M_sum: total number of chats started in the group, calculated as the sum of the products of O_array and p_array.M_sum ones (chats started) and the rest zeros (no chats started).def generate_incorrect_group_sample(group_size, p=0.3):
O_array = norm(loc=10, scale=1).rvs(size=group_size).astype(int)
p_array = uniform().rvs(size=group_size)
sample_size = numpy.sum(O_array)
M_sum = numpy.sum((O_array * p_array).astype(int))
return [1] * M_sum + [0] * (sample_size - M_sum)
numpy.random.seed(42)
mc_size = 1000
bad_cnt = 0
alpha = 0.05
group_size = 1000
for _ in range(mc_size):
test_sample = generate_incorrect_group_sample(group_size)
control_sample = generate_incorrect_group_sample(group_size)
pvalue = ttest_ind(test_sample, control_sample, equal_var=True, alternative='two-sided').pvalue
bad_cnt += (pvalue <= alpha)
print(f"FPR: {round(bad_cnt / mc_size, 4)}")
print(f"CI={proportion_confint(count = bad_cnt, nobs = mc_size, alpha=0.05, method='wilson')}")FPR: 0.263
CI=(0.23665685979107837, 0.2911570237310747)
Why?
The t-test assumes that the data is independent and identically distributed (i.i.d.), which is not true in this case.
The problem is that we only have two users, and the t-test is not applicable. So we “multiplied” them and obtained samples of size 10,000, and we are applying the t-test.
Not conversion from opening an ad to chat, but an increase in chats!
Hypothesis:
\[ H_1: \mathbb{E}\overline{M^C} < \mathbb{E} \overline{M^T} \]
\(M^C,\ M^T\) — average number of visits to the messenger by users in the test and control groups.
In this case, we look at the average user rather than the open advertisement, as in the previous case.
We also eliminate dependencies in selections: opening a chat by one user does not affect the opening of a chat by another user.
O_array: number of visits to the ad page by each user in the group, generated from a normal distribution.p_array: conversion rate for each user, generated from a uniform distribution.M_array: number of chats started by each user, calculated as the product of O_array and p_array.M_array.def generate_group_sample(group_size):
O_array = norm(loc=50, scale=5).rvs(size=group_size).astype(int)
p_array = uniform().rvs(size=group_size)
M_array = (O_array * p_array).astype(int)
return {
"O_array": O_array,
"message_array": M_array
}
numpy.random.seed(42)
mc_size = 1000
bad_cnt = 0
alpha = 0.05
group_size = 1000
for _ in range(mc_size):
test_sample = generate_group_sample(group_size)
control_sample = generate_group_sample(group_size)
pvalue = ttest_ind(test_sample['message_array'], control_sample['message_array'],
equal_var=False, alternative='two-sided').pvalue
bad_cnt += (pvalue <= alpha)
print(f"FPR: {round(bad_cnt / mc_size, 4)}")
print(f"CI={proportion_confint(count = bad_cnt, nobs = mc_size, alpha=0.05, method='wilson')}")FPR: 0.05
CI=(0.03813026239274881, 0.06531382024425081)
What is the maximum number of chats any OLX user can have?
It’s unclear. But what if you know he only visited the page with one ad? Then you can easily answer the question: 1 chat.
This means the number of chats may strongly correlate with user ad visits.
And we don’t want to lose this information: if we can use it, we can create a more powerful criterion!
Let:
\[ \begin{align} &H_0: \mathbb{E} \dfrac{\sum M^T}{\sum O^T} - \mathbb{E} \dfrac{\sum M^C}{\sum O^C} = 0\ vs.\ \\ &H_1: \mathbb{E} \dfrac{\sum M^T}{\sum O^T} - \mathbb{E} \dfrac{\sum M^C}{\sum O^C} > 0 \end{align} \]
Ultimately, we are interested in calculating the statistics
\[ K = \dfrac{\sum M_T}{\sum O_T} - \dfrac{\sum M_C}{\sum O_C} \]
and constructing a confidence interval for it. But how?
In this case, it is essential to remember that in Bootstrap, you need to sample both the number of chats and the number of visits to the ad simultaneously, because they depend on a single user!
def ratio_bootstrap(message_array, message_control, n_test, n_control, alpha=0.05):
theta_func = lambda MT, MC, OT, OC: numpy.sum(MT, axis=1) / numpy.sum(OT, axis=1) - numpy.sum(MC, axis=1) / numpy.sum(OC, axis=1)
B = 1000
batch_size = B // 20
theta_asterisk_array = []
test_size = len(message_array)
control_size = len(message_control)
test_inds_array = numpy.arange(0, test_size)
control_inds_array = numpy.arange(0, control_size)
for _ in range(0, B, batch_size):
boot_t_inds = numpy.random.choice(test_inds_array, replace=True, size=(batch_size, test_size))
boot_c_inds = numpy.random.choice(control_inds_array, replace=True, size=(batch_size, control_size))
# Wrong way to sample with replacement
# boot_MT = numpy.random.choice(message_array, replace=True, size=(batch_size, test_size))
# boot_MC = numpy.random.choice(message_control, replace=True, size=(batch_size, control_size))
# boot_OT = numpy.random.choice(n_test, replace=True, size=(batch_size, test_size))
# boot_OC = numpy.random.choice(n_control, replace=True, size=(batch_size, control_size))
boot_MT = message_array[boot_t_inds]
boot_MC = message_control[boot_c_inds]
boot_OT = n_test[boot_t_inds]
boot_OC = n_control[boot_c_inds]
theta_asterisk = theta_func(
boot_MT, boot_MC, boot_OT, boot_OC
).ravel()
assert len(theta_asterisk) == batch_size
theta_asterisk_array = numpy.concatenate([theta_asterisk_array, theta_asterisk])
left_theta_asterisk, right_theta_asterisk = numpy.quantile(theta_asterisk_array, [alpha / 2, 1 - alpha / 2])
percentile_left_bound, percentile_right_bound = left_theta_asterisk, right_theta_asterisk
return percentile_left_bound, percentile_right_bound
numpy.random.seed(8)
bad_cnt_correct_bootstrap = 0
for _ in range(mc_size):
test_sample = generate_group_sample(group_size)
control_sample = generate_group_sample(group_size)
left_bound_correct, right_bound_correct = ratio_bootstrap(
test_sample['message_array'],
control_sample['message_array'],
test_sample['O_array'],
control_sample['O_array']
)
bad_cnt_correct_bootstrap += ((left_bound_correct > 0) | (right_bound_correct < 0))
corr_fpr = round(bad_cnt_correct_bootstrap / mc_size, 4)
corr_left, corr_right = proportion_confint(count = bad_cnt_correct_bootstrap,
nobs = mc_size, alpha=0.05, method='wilson')
print(f"FPR, correct bootstrap: {round(corr_fpr, 4)},"\
f" [{round(corr_left, 4)}, {round(corr_right, 4)}]")FPR, correct bootstrap: 0.054, [0.0416, 0.0698]
To generate an AB test, we will slightly modify the user sample generation function: we will add the ability to change conversions in the test and control.
def generate_group_sample_equal_cr(group_size, p=0.1):
O_array = norm(loc=50, scale=5).rvs(size=group_size).astype(int)
p_array = norm(loc=p, scale=p / 10).rvs(group_size)
p_array = numpy.maximum(p_array, 0)
M_array = (O_array * p_array).astype(int)
return {
"O_array": O_array,
"message_array": M_array
}
numpy.random.seed(8)
power_ratio_cnt = 0
power_means_cnt = 0
for _ in range(mc_size):
test_sample = generate_group_sample_equal_cr(group_size, p=0.101)
control_sample = generate_group_sample_equal_cr(group_size, p=0.1)
left_bound_ratio, right_bound_ratio = ratio_bootstrap(
test_sample['message_array'],
control_sample['message_array'],
test_sample['O_array'],
control_sample['O_array']
)
pvalue_means = ttest_ind(test_sample['message_array'], control_sample['message_array'],
equal_var=False, alternative='two-sided').pvalue
power_ratio_cnt += (left_bound_ratio > 0)
power_means_cnt += (pvalue_means < 0.05)
power_ratio = round(power_ratio_cnt / mc_size, 4)
ratio_left, ratio_right = proportion_confint(
count = power_ratio_cnt, nobs = mc_size, alpha=0.05, method='wilson')
print(f"power, ratio: {round(power_ratio, 4)},"\
f" [{round(ratio_left, 4)}, {round(ratio_right, 4)}]")
power_means = round(power_means_cnt / mc_size, 4)
means_left, means_right = proportion_confint(count = power_means_cnt,
nobs = mc_size, alpha=0.05, method='wilson')
print(f"power, means: {round(power_means, 4)},"\
f" [{round(means_left, 4)}, {round(means_right, 4)}]")power, ratio: 0.494, [0.4631, 0.525]
power, means: 0.328, [0.2996, 0.3577]
The main problem with ratio metrics is that you add a denominator to your metric, which can also change based on your changes to the website.
\(H_1: \mathbb{E} \dfrac{\sum M^T}{\sum O^T} - \mathbb{E} \dfrac{\sum M^C}{\sum O^C} > 0\) will be inconsistent with \(H_1: \mathbb{E} M^T - \mathbb{E} M^C > 0\).
Therefore, it is helpful to measure the change in the ratio metric and check the denominator using a t-test!
Therefore, your plan for working with ratio metrics is as follows:
Let’s imagine you have joined a new company where AB testing is poorly organized: everyone does it however they want, making many mistakes in the process.
You decide to create a unified system for AB testing with the following features:
In this case, bootstrapping is not the best solution, as it is slow and requires a lot of memory.
Let’s recall what a Taylor series is and how it can be used to write any function:
\(f(x) = \sum_{k=0}^{n}\dfrac{f^{(k)}(x_0)}{k!}(x - x_0)^k + o((x - x_0)^{n}), x \rightarrow x_0\).
If we expand the function only up to the first derivative, then:
\(f(x) = f(x_0) + f'(x_0)(x-x_0) + o((x - x_0)) \approx f(x_0) + f'(x_0)(x-x_0), x \rightarrow x_0\)
And if you have a function of two variables, then:
\(f(x, y) \approx f(\theta) + f'_x(\theta)(x-x_0) + f'_y(\theta)(y-y_0), (x, y) \rightarrow \theta, \theta := (x_0, y_0)\).
Now, let \(f(x, y) = \dfrac{x}{y}\). Then
\[\begin{align} \dfrac{x}{y} &\approx \dfrac{x_0}{y_0} + \left(\dfrac{1}{y_0}\right)(x - x_0) + \left(-\dfrac{x_0}{y^2_0}\right)(y-y_0) \\ &= \dfrac{x_0}{y_0} + \dfrac{1}{y_0}\left(x - x_0 - \dfrac{x_0}{y_0}y + x_0\dfrac{y_0}{y_0}\right) \\ &= \dfrac{x_0}{y_0} + \dfrac{1}{y_0}\left(x - \dfrac{x_0}{y_0}y \right) \end{align} \]
Now let’s recall our ratio metric of interest: it is \(\dfrac{\sum_i^n {X_i}}{\sum_i^n {Y_i}}\), where all values for all users of the site are summed.
Let’s move from the sum to the average: \(\dfrac{\sum_i^n {X_i}}{\sum_i^n {Y_i}} = \dfrac{\overline{X}}{\overline{Y}}\).
What interesting fact do we know about the average? According to the Law of Large Numbers, we know that:
Let’s then expand \(\dfrac{\overline{X}}{\overline{Y}}\) using Taylor’s formula at the point \((\mathbb{E} X, \mathbb{E} Y)\) with respect to the two variables \(\overline{X}, \overline{Y}\).
\[ \dfrac{\overline{X}}{\overline{Y}} \approx \dfrac{\mathbb{E} X}{\mathbb{E} Y} + \dfrac{1}{\mathbb{E} Y}\left(\overline{X} - \dfrac{\mathbb{E} X}{\mathbb{E} Y}\overline{Y} \right) \]
\[\begin{align} \dfrac{\overline{X}}{\overline{Y}} &\approx \dfrac{\mathbb{E} X}{\mathbb{E} Y} + \dfrac{1}{\mathbb{E} Y}\left(\overline{X} - \dfrac{\mathbb{E} X}{\mathbb{E} Y}\overline{Y} \right) \\ &=\dfrac{\mathbb{E} X}{\mathbb{E} Y} + \dfrac{1}{\mathbb{E} Y} \cdot \dfrac{1}{N}\sum_{i}^N \left(X_i - \dfrac{\mathbb{E} X}{\mathbb{E} Y} Y_i \right) \\ &= \dfrac{1}{N}\sum_{i}^N \left(\dfrac{\mathbb{E} X}{\mathbb{E} Y} + \dfrac{1}{\mathbb{E} Y}\left(X_i - \dfrac{\mathbb{E} X}{\mathbb{E} Y} Y_i \right) \right) \end{align} \]
Let us denote by \(Z'_i := \dfrac{\mathbb{E} X}{\mathbb{E} Y} + \dfrac{1}{\mathbb{E} Y}\left(X_i - \dfrac{\mathbb{E} X}{\mathbb{E} Y} Y_i \right)\).
Then
\[\begin{align} &\dfrac{\overline{X}}{\overline{Y}} \approx \overline Z',\\ &\mathbb{E} \dfrac{\overline{X}}{\overline{Y}} \approx \mathbb{E} \overline Z' \end{align} \]
All \(Z'_i\) are independent and identically distributed (i.i.d.) random variables, so we can use the t-test to check the hypothesis about the average value of \(Z'\). But we don’t knnow the value of \(\mathbb{E} X\) and \(\mathbb{E} Y\).
\[ Z_i = \dfrac{\overline X}{\overline Y} + \dfrac{1}{\overline Y}\left(X_i - \dfrac{\overline X}{\overline Y} Y_i \right) \]
Then we can write:
\[ (\overline{X}, \overline{Y}) \stackrel{p}{\rightarrow} (\mathbb{E} X, \mathbb{E} Y) \]
and
\[ \begin{align} Z_i \stackrel{p}{\rightarrow} \dfrac{\mathbb{E} X}{\mathbb{E} Y} + \dfrac{1}{\mathbb{E} Y}\left(X_i - \dfrac{\mathbb{E} X}{\mathbb{E} Y} Y_i \right) = Z'_i \end{align} \]
For large sample sizes, \(\mathbb{E} Z \approx \mathbb{E} Z'\). This means that \(\mathbb{E} Z \approx \mathbb{E} Z' \approx \mathbb{E} \dfrac{\overline{X}}{\overline{Y}}\).
By applying the t-test to sample \(Z\), we can compare \(\dfrac{\overline{X}}{\overline{Y}}\)
def linearisation(numerator, denominator):
E_num = numpy.mean(numerator)
E_den = numpy.mean(denominator)
return E_num / E_den + 1 / E_den * (numerator - E_num / E_den * denominator)
numpy.random.seed(8)
bad_cnt_linearisation = 0
bad_cnt_bootstrap = 0
for _ in range(mc_size):
test_sample = generate_group_sample_equal_cr(group_size)
control_sample = generate_group_sample_equal_cr(group_size)
test_Z = linearisation(test_sample['O_array'], test_sample['message_array'])
control_Z = linearisation(control_sample['O_array'], control_sample['message_array'])
# Запускаю критерий и строю доверительный интервал
left_bound, right_bound = ratio_bootstrap(
test_sample['O_array'],
control_sample['O_array'],
test_sample['message_array'],
control_sample['message_array']
)
pvalue_linearise = ttest_ind(test_Z, control_Z, equal_var=False, alternative='two-sided').pvalue
bad_cnt_linearisation += (pvalue_linearise < 0.05)
bad_cnt_bootstrap += ((left_bound > 0) | (right_bound < 0))
bootstrap_fpr = round(bad_cnt_bootstrap / mc_size, 4)
bootstrap_left, bootstrap_right = proportion_confint(count = bad_cnt_bootstrap,
nobs = mc_size, alpha=0.05, method='wilson')
print(f"FPR, bootstrap: {round(bootstrap_fpr, 4)},"\
f" [{round(bootstrap_left, 4)}, {round(bootstrap_right, 4)}]")
linearise_fpr = round(bad_cnt_linearisation / mc_size, 4)
linearise_left, linearise_right = proportion_confint(count = bad_cnt_linearisation,
nobs = mc_size, alpha=0.05, method='wilson')
print(f"FPR, linearisation: {round(linearise_fpr, 4)},"\
f" [{round(linearise_left, 4)}, {round(linearise_right, 4)}]")FPR, bootstrap: 0.052, [0.0399, 0.0676]
FPR, linearisation: 0.049, [0.0373, 0.0642]
So we can use the linearization method to calculate the ratio metric and apply the t-test to it.
group_size = 10000
test_sample = generate_group_sample_equal_cr(group_size)
control_sample = generate_group_sample_equal_cr(group_size)
import time
start = time.time()
left_bound, right_bound = ratio_bootstrap(
test_sample['O_array'],
control_sample['O_array'],
test_sample['message_array'],
control_sample['message_array']
)
end = time.time()
print(f"Bootstrap time: {round(end - start, 2)} seconds")
start = time.time()
test_Z = linearisation(test_sample['O_array'], test_sample['message_array'])
control_Z = linearisation(control_sample['O_array'], control_sample['message_array'])
pvalue_linearise = ttest_ind(test_Z, control_Z, equal_var=False, alternative='two-sided').pvalue
end = time.time()
print(f"Linearisation time: {round(end - start, 2)} seconds")
print(f"p-value: {pvalue_linearise}")
if pvalue_linearise < 0.05:
print("Reject H0")
else:
print("Do not reject H0")
print(f"Confidence interval: [{left_bound}, {right_bound}]")Bootstrap time: 0.31 seconds
Linearisation time: 0.0 seconds
p-value: 0.9368145845235889
Do not reject H0
Confidence interval: [-0.039138852630568086, 0.038235767330209953]

Applied Statistics