The p_values in the text area may be slightly inconsistent. This is because many simulaion runs were done after the text was written. However, there are no signfiicant changes in the p_values due to these rerunning of the simulations.
Import required libraries.
import pandas as pd
import numpy as np
import random
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
random.seed(42)
1.
Read in the dataset and take a look at the top few rows.
df = pd.read_csv('./ab_data.csv')
df.head()
b. Use the below cell to find the number of rows in the dataset.
nof_rows = df.shape[0]
nof_rows
c. The number of unique users in the dataset.
df.user_id.nunique()
d. The proportion of users converted.
df.converted.mean()
e. The number of times the new_page
and treatment
don't line up.
df.groupby(['group', 'landing_page']).count()
dont_line_up = 1928 + 1965
dont_line_up
f. Do any of the rows have missing values?
df.isnull().sum()
For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page. Store new dataframe in df2.
df2 = df[df[['group', 'landing_page']].apply(lambda x: x[0] == 'control' and x[1] == 'old_page' or x[0] == 'treatment' and x[1] == 'new_page', axis=1)]
df2.head(10)
# Double Check all of the incorrect rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]
df2.shape
How many unique user_ids are in df2?
df2['user_id'].nunique()
There is one user_id repeated in df2.
df2[df2['user_id'].duplicated()].count()
doubles = df2[df2.user_id.duplicated()|df2.user_id.duplicated()]
What is the row information for the repeat user_id?
doubles
df2.shape
d. Remove one of the rows with a duplicate user_id, but keep your dataframe as df2.
df2 = df2.drop_duplicates(subset='user_id', keep='first')
df2[df2['user_id'].duplicated()].count()
What is the probability of an individual converting regardless of the page they receive?
df2['converted'].mean()
df2.head()
Given that an individual was in the control
group, what is the probability they converted?
ccr = df2.query('group == "control"')['converted'].mean()
ccr
Given that an individual was in the treatment
group, what is the probability they converted?
tcr = df2.query('group == "treatment"')['converted'].mean()
tcr
obs_diff = tcr - ccr
obs_diff
What is the probability that an individual received the new page?
df2_groupby = df2.groupby('landing_page').count()
df2_groupby
new_page_prob = df2_groupby.user_id.iloc[0]/(df2_groupby.user_id.iloc[0] + df2_groupby.user_id.iloc[1])
new_page_prob
e. Use the results in the previous two portions of this question to suggest if you think there is evidence that one page leads to more conversions? Write your response below.
Your answer goes here.
This is all very close. The new page MAY or MAY NOT resulte in a different conversion rate. Further study needs to be done. HOWEVER, it does not look promising for the new page so far.
Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.
However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another?
These questions are the difficult parts associated with A/B tests in general.
1.
For now, consider you need to make the decision just based on all the data provided. If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be? You can state your hypothesis in terms of words or in terms of $p_{old}$ and $p_{new}$, which are the converted rates for the old and new pages.
Put your answer here. $$H_0: p_{new} - p_{old} \leq 0$$
$$H_0: p_{new} - p_{old} > 0$$
2.
Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the converted success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the converted rate in ab_data.csv regardless of the page.
Use a sample size for each page equal to the ones in ab_data.csv.
Perform the sampling distribution for the difference in converted between the two pages over 10,000 iterations of calculating an estimate from the null.
Use the cells below to provide the necessary parts of this simulation. If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem. You can use Quiz 5 in the classroom to make sure you are on the right track.
# Exercise states "Assume under the null hypothesis, p_new and p_old both have
# "true" success rates equal to the converted success rate regardless of page -
# that is p_new and p_old are equal.
proportion = df.converted.mean()
proportion
df2.head()
treatment_df = df2.query('group == "treatment"')
treatment_cr = treatment_df.converted.mean()
treatment_cr
control_df = df2.query('group == "control"')
control_cr = control_df.converted.mean()
control_cr
difference = treatment_cr - control_cr
difference
a. What is the convert rate for $p_{new}$ under the null?
p_new = df2['converted'].mean() # Same as Part 1 4a
p_new
b. What is the convert rate for $p_{old}$ under the null?
p_old = df2['converted'].mean() # Same as Part 1 4a
p_old
c. What is $n_{new}$?
n_new = treatment_df.shape[0]
n_new
d. What is $n_{old}$?
n_old = control_df.shape[0]
n_old
e. Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in new_page_converted.
new_page_converted = np.random.binomial(1, p_new, n_new)
tsm = new_page_converted.mean()
tsm
f. Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in old_page_converted.
old_page_converted = np.random.binomial(1, p_old, n_old)
csm = old_page_converted.mean()
csm
g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).
# compute simulated difference in conversion rate
sim_diff = tsm - csm
# display simulation difference
sim_diff
h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts a. through g. above. Store all 10,000 values in p_diffs.
p_diffs = []
for _ in range(10000):
old_page_converted = np.random.binomial(1, p_old, n_old)
csm_b = old_page_converted.mean()
new_page_converted = np.random.binomial(1, p_new, n_new)
tsm_b = new_page_converted.mean()
p_diffs.append(tsm_b - csm_b)
# The numpy way to simulate the above. Does not require a for loop. It also yields the wrong answer (sigh).
# new_converted_simulation = np.random.binomial(n_new, p_new, 10000)/n_new
# old_converted_simulation = np.random.binomial(n_old, p_old, 10000)/n_old
# p_diffs = new_converted_simulation - old_converted_simulation
df.head()
p_diffs = np.array(p_diffs)
pdm = p_diffs.mean()
pdm
p_diffs.std()
i. Plot a histogram of the p_diffs. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here.
plt.hist(p_diffs);
This is what I expected. A very symmetrical normal looking distribution!
j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?
# Same as Part I 4c minus Part I 4b.
ccr = df2.query('group == "control"')['converted'].mean()
tcr = df2.query('group == "treatment"')['converted'].mean()
obs_diff = tcr - ccr
obs_diff
# compute p value
pvalue = (p_diffs > obs_diff).mean()
pvalue
# plot line for observed statistic
plt.hist(p_diffs, alpha=.5)
plt.axvline(x=obs_diff, color='red');
k. In words, explain what you just computed in part j.. What is this value called in scientific studies? What does this value mean in terms of whether or not there is a difference between the new and old pages?
Put your answer here.
l. We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let n_old
and n_new
refer the the number of rows associated with the old page and new pages, respectively.
df2.head()
import statsmodels.api as sm
convert_old = df2.query('group == "control" and converted == 1').shape[0]
convert_new = df2.query('group == "treatment" and converted == 1').shape[0]
n_old = control_df.shape[0]
n_new = treatment_df.shape[0]
print(convert_old, convert_new, n_old, n_new)
m. Now use stats.proportions_ztest
to compute your test statistic and p-value. Here is a helpful link on using the built in.
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
z_score, p_value
n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts j. and k.?
Put your answer here.
1.
In this final part, you will see that the result you acheived in the previous A/B test can also be acheived by performing regression.
a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?
Put your answer here.
We would use a logistic regression model. The reason why we would use this versus Ordinairy Least Squares (OLS/Chi-square goodness of fit) is that we have a categorical dependent variable (conversion rate). We will use statsmodels Logit method.
b. The goal is to use statsmodels to fit the regression model you specified in part a. to see if there is a significant difference in conversion based on which page a customer receives. However, you first need to create a column for the intercept, and create a dummy variable column for which page each user received. Add an intercept column, as well as an ab_page column, which is 1 when an individual receives the treatment and 0 if control.
# Painful to rerun simulation if I screw up df2. Make a copy. Also avoids hidden
# memory change (or not change) issues.
df2c = df2.copy()
df2c.head(2)
df2c[['group_1', 'group_2' ]] = pd.get_dummies(df2c.group)
df2c[['lp1', 'lp2' ]] = pd.get_dummies(df2c.landing_page)
df2c = df2c.rename(columns={'group_2': 'ab_page', 'lp1': 'treatment'})
df2c = df2c.drop(['timestamp','group', 'landing_page', 'lp2', 'group_1', 'treatment'], axis=1)
df2c.head()
df2c['intercept'] = 1
c. Use statsmodels to import your regression model. Instantiate the model, and fit the model using the two columns you created in part b. to predict whether or not an individual converts.
log_mod = sm.Logit(df2c['converted'], df2c[['intercept', 'ab_page']])
results = log_mod.fit()
results.summary()
d. Provide the summary of your model below, and use it as necessary to answer the following questions.
# Need to exponentiate in order to interpret them.
np.exp(-1.9888), np.exp(-0.0150)
# Since they are negative, easier to explain as 1/np.exp()
1/np.exp(-1.9888), 1/np.exp(-0.0150)
Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.
e. What is the p-value associated with ab_page?
f. Now, you are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add into your regression model. Are there any disadvantages to adding additional terms into your regression model?
Put your answer here.
g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives. You will need to read in the countries.csv dataset and merge together your datasets on the approporiate rows. Here are the docs for joining tables.
Does it appear that country had an impact on conversion? Provide the statistical output as well as a written response to answer this question.
Answer provided after results.summary() below
cdf = pd.read_csv('./countries.csv')
cdf.head(2)
cdf2 = df2c.merge(cdf, on='user_id', how='inner')
cdf2.head(2)
cdf2.isnull().sum()
cdf2.country.unique()
cdf2[['CA', 'UK', 'US']] = pd.get_dummies(cdf2['country'])
cdf2 = cdf2.drop(['country', 'CA'], axis=1)
cdf2.head()
log_mod = sm.Logit(cdf2['converted'], cdf2[['intercept', 'ab_page', 'UK', 'US']])
results = log_mod.fit()
results.summary()
# Need to exponentiate in order to interpret them.
np.exp(-0.0149), np.exp(0.0506), np.exp(0.0408)
# Explain negative, easier to explain as 1/np.exp()
1/np.exp(-0.0149)
Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.
None of these p_values are significant. They are all above .05. We would accept the null hypothesis. The new_page is not significantly better than the old_page.
h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.
Provide the summary results, and your conclusions based on the results.
df_int = cdf2.copy()
df_int.head()
log_mod = sm.Logit(df_int['converted'], df_int[['intercept', 'ab_page', 'UK', 'US']])
results = log_mod.fit()
results.summary()
df_int['UK_ab_page'] = df_int['UK'] * df_int['ab_page']
df_int['US_ab_page'] = df_int['US'] * df_int['ab_page']
df_int = df_int.drop('intercept', axis=1)
df_int.head()
df_int.corr(method='spearman')
This question is asking what the covariance is between the country and the page. For example, do people from the UK like the new_page while people from the USA like the old_page? Here are my findings.
df2.head(2)
df_usa = df2.merge(cdf, on='user_id', how='inner')
df_usa = df_usa.query('country == "US"')
df_usa.head(2)
n_old = df_usa.query('group == "control"').count().tolist()
n_old = n_old[0]
n_old
convert_old = df_usa.query('group == "control" and converted == 1').count().tolist()
convert_old = convert_old[0]
convert_old
n_new = df_usa.query('group == "treatment"').count().tolist()
n_new = n_new[0]
n_new
convert_new = df_usa.query('group == "treatment" and converted == 1').count()
convert_new = convert_new[0]
convert_new
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
z_score, p_value
The ztest yields a p_value of .9339. That is > a .05 level of significance. We would accept the null hypothesis_USA. People from the USA are no more likely to click on the new_page than any other country and the old_page is as good or better than the new_page.
I am Canadian. I want to see this question explicitly answered from a Canadian point of view.
cdf = pd.read_csv('./countries.csv')
cdf.head(2)
cdf2 = df2c.merge(cdf, on='user_id', how='inner')
cdf2.head(2)
cdf2.isnull().sum()
cdf2.country.unique()
cdf2[['CA', 'UK', 'US']] = pd.get_dummies(cdf2['country'])
cdf2 = cdf2.drop(['country', 'US'], axis=1)
cdf2.head()
log_mod = sm.Logit(cdf2['converted'], cdf2[['intercept', 'ab_page', 'CA', 'UK']])
results = log_mod.fit()
results.summary()
# Need to exponentiate in order to interpret them.
np.exp(-0.0149), np.exp(-0.0408), np.exp(0.0099)
# Explain negative, easier to explain as 1/np.exp()
1/np.exp(-0.0149), 1/np.exp(-0.0408)
Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.
None of these p_values are significant. They are all above .05. We would accept the null hypothesis. The new_page is not significantly better than the old_page.
h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.
Provide the summary results, and your conclusions based on the results.
df_int = cdf2.copy()
df_int.head()
log_mod = sm.Logit(df_int['converted'], df_int[['intercept', 'ab_page', 'CA', 'UK']])
results = log_mod.fit()
results.summary()
df_int['UK_ab_page'] = df_int['UK'] * df_int['ab_page']
df_int['CA_ab_page'] = df_int['CA'] * df_int['ab_page']
df_int = df_int.drop('intercept', axis=1)
df_int.head()
df_int.corr(method='spearman')
This question is asking what the covariance is between the country and the page. For example, do people from the UK like the new_page while people from the USA like the old_page? Here are my findings.
The process is:
df2.head(2)
df2s = df2.sort_values('timestamp')
df2s.iloc[0], df2s.iloc[-1]
df2s['day'] = df2s['timestamp'].apply(lambda x: x[8:10])
df2s.head()
df2g = df2s.groupby(['day', 'group']).count().reset_index()
df2g = df2g.drop(['user_id', 'timestamp', 'landing_page'], axis=1)
df2g.head()
locations = df2g.day.unique()
heights_control = df2g.query('group == "control"')['converted'].tolist()
heights_treatment = df2g.query('group == "treatment"')['converted'].tolist()
labels = range(2, 25)
plt.bar(locations, height=heights_control, tick_label=labels, color = 'red', alpha=.25)
plt.bar(locations, height=heights_treatment, tick_label=labels, color='blue', alpha=.25);
Not much to see here.
zp = []
for value in locations:
convert_old = df2s.query('group == "control" and converted == 1 and day == @value').shape[0]
convert_new = df2s.query('group == "treatment" and converted == 1 and day == @value').shape[0]
n_old = df2s.query('group == "control" and day == @value').shape[0]
n_new = df2s.query('group == "treatment" and day == @value').shape[0]
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
zp.append((value, z_score, p_value))
pd.DataFrame(zp, columns = ['day', 'zscore', 'pvalue']).sort_values('pvalue')
For one brilliant day, Jan 10, 2017, the new_page had a significant p_value of .008 which is < .05. However, there were no other days that exhibited this behaviour. Nor is there a pattern that arises from this day. E.g. building or subsiding from it. There are also no weekend or weekday patterns that emerged. We would accept the null hypothesis. There are no significant time related changes in conversion rates in this dataset.
Based on simulatons, logistical regression models (Logit) and z_tests we do not see any significant results. The p_values in all cases, except for one (January 10, 2017) were not signficant. We would accept the null hypothesis. The old_page is just as good if not better than the new_page.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])