Analyze A/B Test Results

Table of Contents

Introduction

The p_values in the text area may be slightly inconsistent. This is because many simulaion runs were done after the text was written. However, there are no signfiicant changes in the p_values due to these rerunning of the simulations.

Part I - Probability

Import required libraries.

In [1]:
import pandas as pd
import numpy as np
import random
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
random.seed(42)

1. Read in the dataset and take a look at the top few rows.

In [2]:
df = pd.read_csv('./ab_data.csv')
df.head()
Out[2]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1

b. Use the below cell to find the number of rows in the dataset.

In [3]:
nof_rows = df.shape[0]
nof_rows
Out[3]:
294478

c. The number of unique users in the dataset.

In [4]:
df.user_id.nunique()
Out[4]:
290584

d. The proportion of users converted.

In [5]:
df.converted.mean()
Out[5]:
0.11965919355605512

e. The number of times the new_page and treatment don't line up.

In [6]:
df.groupby(['group', 'landing_page']).count()
Out[6]:
user_id timestamp converted
group landing_page
control new_page 1928 1928 1928
old_page 145274 145274 145274
treatment new_page 145311 145311 145311
old_page 1965 1965 1965
In [7]:
dont_line_up = 1928 + 1965
dont_line_up
Out[7]:
3893

f. Do any of the rows have missing values?

In [8]:
df.isnull().sum()
Out[8]:
user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page. Store new dataframe in df2.

In [9]:
df2 = df[df[['group', 'landing_page']].apply(lambda x: x[0] == 'control' and x[1] == 'old_page' or x[0] == 'treatment' and x[1] == 'new_page', axis=1)]
df2.head(10)
Out[9]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1
5 936923 2017-01-10 15:20:49.083499 control old_page 0
6 679687 2017-01-19 03:26:46.940749 treatment new_page 1
7 719014 2017-01-17 01:48:29.539573 control old_page 0
8 817355 2017-01-04 17:58:08.979471 treatment new_page 1
9 839785 2017-01-15 18:11:06.610965 treatment new_page 1
In [10]:
# Double Check all of the incorrect rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]
Out[10]:
0
In [11]:
df2.shape
Out[11]:
(290585, 5)

How many unique user_ids are in df2?

In [12]:
df2['user_id'].nunique()
Out[12]:
290584

There is one user_id repeated in df2.

In [13]:
df2[df2['user_id'].duplicated()].count()
Out[13]:
user_id         1
timestamp       1
group           1
landing_page    1
converted       1
dtype: int64
In [14]:
doubles = df2[df2.user_id.duplicated()|df2.user_id.duplicated()]

What is the row information for the repeat user_id?

In [15]:
doubles
Out[15]:
user_id timestamp group landing_page converted
2893 773192 2017-01-14 02:55:59.590927 treatment new_page 0
In [16]:
df2.shape
Out[16]:
(290585, 5)

d. Remove one of the rows with a duplicate user_id, but keep your dataframe as df2.

In [17]:
df2 = df2.drop_duplicates(subset='user_id', keep='first')
In [18]:
df2[df2['user_id'].duplicated()].count()
Out[18]:
user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

What is the probability of an individual converting regardless of the page they receive?

In [19]:
df2['converted'].mean()
Out[19]:
0.11959708724499628
In [20]:
df2.head()
Out[20]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1

Given that an individual was in the control group, what is the probability they converted?

In [21]:
ccr = df2.query('group == "control"')['converted'].mean()
ccr
Out[21]:
0.1203863045004612

Given that an individual was in the treatment group, what is the probability they converted?

In [22]:
tcr = df2.query('group == "treatment"')['converted'].mean()
tcr
Out[22]:
0.11880806551510564
In [23]:
obs_diff = tcr - ccr
obs_diff
Out[23]:
-0.0015782389853555567

What is the probability that an individual received the new page?

In [24]:
df2_groupby = df2.groupby('landing_page').count()
df2_groupby
Out[24]:
user_id timestamp group converted
landing_page
new_page 145310 145310 145310 145310
old_page 145274 145274 145274 145274
In [25]:
new_page_prob = df2_groupby.user_id.iloc[0]/(df2_groupby.user_id.iloc[0] + df2_groupby.user_id.iloc[1])
new_page_prob
Out[25]:
0.5000619442226688

e. Use the results in the previous two portions of this question to suggest if you think there is evidence that one page leads to more conversions? Write your response below.

Your answer goes here.

  • The overall conversion rate is 0.11965919355605512 BEFORE you clean up the data.
  • AFTER you clean up the data the overall conversion rate is 0.11959708724499628
  • conversion in the control group with Old Page is 0.1203863045004612.
  • conversion in the treatment group with New Page is 0.11880806551510564

This is all very close. The new page MAY or MAY NOT resulte in a different conversion rate. Further study needs to be done. HOWEVER, it does not look promising for the new page so far.

Part II - A/B Test

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another?

These questions are the difficult parts associated with A/B tests in general.

1. For now, consider you need to make the decision just based on all the data provided. If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be? You can state your hypothesis in terms of words or in terms of $p_{old}$ and $p_{new}$, which are the converted rates for the old and new pages.

Put your answer here. $$H_0: p_{new} - p_{old} \leq 0$$

$$H_0: p_{new} - p_{old} > 0$$

2. Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the converted success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the converted rate in ab_data.csv regardless of the page.

Use a sample size for each page equal to the ones in ab_data.csv.

Perform the sampling distribution for the difference in converted between the two pages over 10,000 iterations of calculating an estimate from the null.

Use the cells below to provide the necessary parts of this simulation. If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem. You can use Quiz 5 in the classroom to make sure you are on the right track.

In [26]:
# Exercise states "Assume under the null hypothesis,  p_new and p_old both have
# "true" success rates equal to the converted success rate regardless of page - 
# that is  p_new  and  p_old  are equal.
proportion = df.converted.mean()
proportion
Out[26]:
0.11965919355605512
In [27]:
df2.head()
Out[27]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1
In [28]:
treatment_df = df2.query('group == "treatment"')
treatment_cr = treatment_df.converted.mean()
treatment_cr
Out[28]:
0.11880806551510564
In [29]:
control_df = df2.query('group == "control"')
control_cr = control_df.converted.mean()
control_cr
Out[29]:
0.1203863045004612
In [30]:
difference = treatment_cr - control_cr
difference
Out[30]:
-0.0015782389853555567

a. What is the convert rate for $p_{new}$ under the null?

In [31]:
p_new = df2['converted'].mean() # Same as Part 1 4a
p_new
Out[31]:
0.11959708724499628

b. What is the convert rate for $p_{old}$ under the null?

In [32]:
p_old = df2['converted'].mean() # Same as Part 1 4a
p_old
Out[32]:
0.11959708724499628

c. What is $n_{new}$?

In [33]:
n_new = treatment_df.shape[0]
n_new
Out[33]:
145310

d. What is $n_{old}$?

In [34]:
n_old = control_df.shape[0]
n_old
Out[34]:
145274

e. Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in new_page_converted.

In [35]:
new_page_converted = np.random.binomial(1, p_new, n_new)
tsm = new_page_converted.mean()
tsm
Out[35]:
0.12058357993255799

f. Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in old_page_converted.

In [36]:
old_page_converted = np.random.binomial(1, p_old, n_old)
csm = old_page_converted.mean()
csm
Out[36]:
0.11826617288709611

g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

In [37]:
# compute simulated difference in conversion rate
sim_diff = tsm - csm

# display simulation difference
sim_diff
Out[37]:
0.002317407045461878

h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts a. through g. above. Store all 10,000 values in p_diffs.

In [38]:
p_diffs = []
for _ in range(10000):
    old_page_converted = np.random.binomial(1, p_old, n_old)
    csm_b = old_page_converted.mean()
    new_page_converted = np.random.binomial(1, p_new, n_new)
    tsm_b = new_page_converted.mean()
    p_diffs.append(tsm_b - csm_b)
In [39]:
# The numpy way to simulate the above. Does not require a for loop. It also yields the wrong answer (sigh).
# new_converted_simulation = np.random.binomial(n_new, p_new, 10000)/n_new
# old_converted_simulation = np.random.binomial(n_old, p_old, 10000)/n_old
# p_diffs = new_converted_simulation - old_converted_simulation
In [40]:
df.head()
Out[40]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1
In [41]:
p_diffs = np.array(p_diffs)
pdm = p_diffs.mean()
pdm
Out[41]:
1.655761946158373e-05
In [42]:
p_diffs.std()
Out[42]:
0.0011957193070001278

i. Plot a histogram of the p_diffs. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here.

In [43]:
plt.hist(p_diffs);

This is what I expected. A very symmetrical normal looking distribution!

j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?

In [44]:
# Same as Part I 4c minus Part I 4b.
ccr = df2.query('group == "control"')['converted'].mean()
tcr = df2.query('group == "treatment"')['converted'].mean()
obs_diff = tcr - ccr
obs_diff
Out[44]:
-0.0015782389853555567
In [45]:
# compute p value
pvalue = (p_diffs > obs_diff).mean()
pvalue
Out[45]:
0.9105
In [46]:
# plot line for observed statistic
plt.hist(p_diffs, alpha=.5)
plt.axvline(x=obs_diff, color='red');

k. In words, explain what you just computed in part j.. What is this value called in scientific studies? What does this value mean in terms of whether or not there is a difference between the new and old pages?

Put your answer here.

  • In part j we calculated what proportion of the p_diffs array (simulated difference in means) is greater than the observed difference. That is the actual difference from the dataset between the control group conversion rate and the treatment group conversion rate. This is NOT the simulated difference calculated above (sim_diff). The p_value is 0.9038.
  • In scientific studies the definition of the p-value is and this is from Investopedia https://www.investopedia.com/terms/p/p-value.asp). "The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected."
  • We computed a p_value of .9038 which means that 90.38% of the differences can be attributed to pure chance. 90.38% of Part II 2h's p_diffs (difference in means) is higher than the observed difference we see in our dataset. This was only a simulation. Not the actual data. The observed difference in the data is only higher than ~ 9% of the simulated difference assuming both pages conversion rates are the same. This is well above the .05 null hypothesis threshold. This is not a significant result. We would accept the null hypothesis.

l. We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively.

In [47]:
df2.head()
Out[47]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
2 661590 2017-01-11 16:55:06.154213 treatment new_page 0
3 853541 2017-01-08 18:28:03.143765 treatment new_page 0
4 864975 2017-01-21 01:52:26.210827 control old_page 1
In [48]:
import statsmodels.api as sm

convert_old = df2.query('group == "control" and converted == 1').shape[0]
convert_new = df2.query('group == "treatment" and converted == 1').shape[0]
n_old = control_df.shape[0]
n_new = treatment_df.shape[0]
print(convert_old, convert_new, n_old, n_new)
C:\Users\linds\Anaconda2\envs\ipykernel_py3\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
17489 17264 145274 145310

m. Now use stats.proportions_ztest to compute your test statistic and p-value. Here is a helpful link on using the built in.

In [49]:
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
z_score, p_value
Out[49]:
(1.3109241984234394, 0.9050583127590245)

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts j. and k.?

Put your answer here.

  • A z_score is the number of standard deviations away from the mean that the element is. In this case the element is the mean of the conversion events. With a z_score of 1.31 this means that this value lies between the 1st and 2nd standard deviations. A standard deviation of 1 contains ~68% of all elements. A standard deviation of 2 contains ~ 95% of all elements. This z_score puts the p_value comfortably in the distribution.
  • .9051 > .05. With a p_value of .9051 we accept the null hypothesis. That is the old_page is the same or better than the new_page.
  • Yes, p_values are virtually identical. In the ztest they are .9051 and with the simulation the p_value was .9038. This ztest does agree with the findings in Part II j and k.

Part III - A regression approach

1. In this final part, you will see that the result you acheived in the previous A/B test can also be acheived by performing regression.

a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?

Put your answer here.

We would use a logistic regression model. The reason why we would use this versus Ordinairy Least Squares (OLS/Chi-square goodness of fit) is that we have a categorical dependent variable (conversion rate). We will use statsmodels Logit method.

b. The goal is to use statsmodels to fit the regression model you specified in part a. to see if there is a significant difference in conversion based on which page a customer receives. However, you first need to create a column for the intercept, and create a dummy variable column for which page each user received. Add an intercept column, as well as an ab_page column, which is 1 when an individual receives the treatment and 0 if control.

In [50]:
# Painful to rerun simulation if I screw up df2. Make a copy. Also avoids hidden 
# memory change (or not change) issues.
df2c = df2.copy()
df2c.head(2)
Out[50]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
In [51]:
df2c[['group_1', 'group_2' ]] = pd.get_dummies(df2c.group)
df2c[['lp1', 'lp2' ]] = pd.get_dummies(df2c.landing_page)
df2c = df2c.rename(columns={'group_2': 'ab_page', 'lp1': 'treatment'})
df2c = df2c.drop(['timestamp','group',  'landing_page', 'lp2', 'group_1', 'treatment'], axis=1)
In [52]:
df2c.head()
Out[52]:
user_id converted ab_page
0 851104 0 0
1 804228 0 0
2 661590 0 1
3 853541 0 1
4 864975 1 0
In [53]:
df2c['intercept'] = 1

c. Use statsmodels to import your regression model. Instantiate the model, and fit the model using the two columns you created in part b. to predict whether or not an individual converts.

In [54]:
log_mod = sm.Logit(df2c['converted'], df2c[['intercept', 'ab_page']])
results = log_mod.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6
Out[54]:
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290582
Method: MLE Df Model: 1
Date: Fri, 29 Jun 2018 Pseudo R-squ.: 8.077e-06
Time: 11:51:15 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1899
coef std err z P>|z| [0.025 0.975]
intercept -1.9888 0.008 -246.669 0.000 -2.005 -1.973
ab_page -0.0150 0.011 -1.311 0.190 -0.037 0.007

d. Provide the summary of your model below, and use it as necessary to answer the following questions.

In [55]:
# Need to exponentiate in order to interpret them.
np.exp(-1.9888), np.exp(-0.0150)
Out[55]:
(0.13685955841614492, 0.9851119396030626)
In [56]:
# Since they are negative, easier to explain as 1/np.exp()
1/np.exp(-1.9888), 1/np.exp(-0.0150)
Out[56]:
(7.3067603868728614, 1.015113064615719)

Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.

  • When you have values that are less than one it is often advantageous to compute the reciprocal in order to explain the results more clearly. For each 1 unit decrease in ab_page, conversion is 1.015 times as likely holding all else constant. Not much of an impact.

e. What is the p-value associated with ab_page?

  • p-value = .19

Put your answer here.

  • In Part II the p_value is from .9051 to .9038. In Part III the p_value is .19.
  • The reason is in Part II we constructed a hypothesis that would only be tested in one direction. We did this because we wanted to prove the Alternate Hypothes is True. Ergo the new_page has a significantly higher conversion rate than the old_page. This is just one sided. It is not interested in the possibility that the new_page could be much worse than the old_page. If the new_page is much worse. No big deal. We just keep the old_page. Our hypotheses statements affect what our p-value is because it changes which side(s) (or both sides) from the distribution we are calculating the p-value from. The hypotheses in Part II is a one tail test. Part III is a two tail regression test.
  • You can equivalence the p-values using this math. (1 - (0.19/2) = 0.95).We calculated in Part II j a p_value 0f .9038 and a p_value of .9051 in Part II l. Close enough to .95 to be equivalent. In Part II we are concerned with which page had a higher conversion rate, so a one-tailed test. However, in Part III, a regression test is not concerned with a positive or negative change. It checks to see if the independent variable (ab_page/new_page) had any effect at all, so a two-tailed test.

f. Now, you are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add into your regression model. Are there any disadvantages to adding additional terms into your regression model?

Put your answer here.

  • One idea alluded to within this project is looking at the date and time information to see if that has correlations within it that changes the conversion rates in favor of the new_page. Timing of behaviour within an A/B test is often quite important. Different people respond differently to change. Some people like change (almost no matter what). Other people have a lot of inertia when it comes to change, even if it is positive. This data is time and date stamped. We will explore that idea below.
  • It is always good to reflect upon what your data seems to be telling you. Simpson's Paradox is an excellent example of that. Looking at the data within categories (for example), sometimes obscures the fact, that overall the answer to the question that you are asking is in effect opposite of what you observe in the specific catetegories.
  • Simpson's paradox (https://en.wikipedia.org/wiki/Simpson%27s_paradox), or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox.[1]
  • From https://en.wikipedia.org/wiki/Multiple_comparisons_problem In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously[1] or infers a subset of parameters selected based on the observed values.[2] In certain fields it is known as the look-elsewhere effect. The more inferences are made, the more likely erroneous inferences are to occur. This is a potential problem.
  • There might be behaviours associated with visiting the webpage that we are not capturing. Perhaps the question should be, given that this is a big decision for people, they are likely to visit the conversion page several times. Perhaps the question should be: "For those that are likely to convert (e.g. defined as more than one visit to the conversion page) which conversion page (old_page vs new_page) had more conversions." Unfortunately, this data does not seem to contain that information. However, in the real world, it is extremely likely that this data would be captured. Identifying unique visitors to a website and how many times they visited a page and what the clicked on, the duration etc., is a requirement for virtually all data gathering by organizations on their websites.
  • One likely disadvantage of adding additional variables to the model is covariance. That is two or more variables are correlated. Covariance causes interpretation difficulties and can and will skew R values. This creates challenges with creating appropriate statistical tests and interpreting them.

g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives. You will need to read in the countries.csv dataset and merge together your datasets on the approporiate rows. Here are the docs for joining tables.

Does it appear that country had an impact on conversion? Provide the statistical output as well as a written response to answer this question.

Answer provided after results.summary() below

In [57]:
cdf = pd.read_csv('./countries.csv')
In [58]:
cdf.head(2)
Out[58]:
user_id country
0 834778 UK
1 928468 US
In [59]:
cdf2 = df2c.merge(cdf, on='user_id', how='inner')
cdf2.head(2)
Out[59]:
user_id converted ab_page intercept country
0 851104 0 0 1 US
1 804228 0 0 1 US
In [60]:
cdf2.isnull().sum()
Out[60]:
user_id      0
converted    0
ab_page      0
intercept    0
country      0
dtype: int64
In [61]:
cdf2.country.unique()
Out[61]:
array(['US', 'CA', 'UK'], dtype=object)
In [62]:
cdf2[['CA', 'UK', 'US']] = pd.get_dummies(cdf2['country'])
cdf2 = cdf2.drop(['country', 'CA'], axis=1)
In [63]:
cdf2.head()
Out[63]:
user_id converted ab_page intercept UK US
0 851104 0 0 1 0 1
1 804228 0 0 1 0 1
2 661590 0 1 1 0 1
3 853541 0 1 1 0 1
4 864975 1 0 1 0 1
In [64]:
log_mod = sm.Logit(cdf2['converted'], cdf2[['intercept', 'ab_page', 'UK', 'US']])
results = log_mod.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6
Out[64]:
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290580
Method: MLE Df Model: 3
Date: Fri, 29 Jun 2018 Pseudo R-squ.: 2.323e-05
Time: 11:51:18 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1760
coef std err z P>|z| [0.025 0.975]
intercept -2.0300 0.027 -76.249 0.000 -2.082 -1.978
ab_page -0.0149 0.011 -1.307 0.191 -0.037 0.007
UK 0.0506 0.028 1.784 0.074 -0.005 0.106
US 0.0408 0.027 1.516 0.130 -0.012 0.093
In [65]:
# Need to exponentiate in order to interpret them.
np.exp(-0.0149), np.exp(0.0506), np.exp(0.0408)
Out[65]:
(0.9852104557227469, 1.0519020483004984, 1.0416437559600236)
In [66]:
# Explain negative, easier to explain as 1/np.exp()
1/np.exp(-0.0149)
Out[66]:
1.0150115583846535

Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.

  • For the UK variable (if the user was from the UK) conversion is 1.052 times as likely, holding all else constant.
  • For the US variable (if the user was from the US) conversion is 1.042 times as likely, holding all else constant.
  • When you have values that are less than one it is often advantageous to compute the reciprocal in order to explain the results more clearly. For each 1 unit decrease in new_page, conversion is 1.015 times as likely holding all else constant.

None of these p_values are significant. They are all above .05. We would accept the null hypothesis. The new_page is not significantly better than the old_page.

h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.

Provide the summary results, and your conclusions based on the results.

In [67]:
df_int = cdf2.copy()
df_int.head()
Out[67]:
user_id converted ab_page intercept UK US
0 851104 0 0 1 0 1
1 804228 0 0 1 0 1
2 661590 0 1 1 0 1
3 853541 0 1 1 0 1
4 864975 1 0 1 0 1
In [68]:
log_mod = sm.Logit(df_int['converted'], df_int[['intercept', 'ab_page', 'UK',  'US']])
results = log_mod.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6
Out[68]:
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290580
Method: MLE Df Model: 3
Date: Fri, 29 Jun 2018 Pseudo R-squ.: 2.323e-05
Time: 11:51:20 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1760
coef std err z P>|z| [0.025 0.975]
intercept -2.0300 0.027 -76.249 0.000 -2.082 -1.978
ab_page -0.0149 0.011 -1.307 0.191 -0.037 0.007
UK 0.0506 0.028 1.784 0.074 -0.005 0.106
US 0.0408 0.027 1.516 0.130 -0.012 0.093
  • None of the p_values are significant.
In [69]:
df_int['UK_ab_page'] = df_int['UK'] * df_int['ab_page']
df_int['US_ab_page'] = df_int['US'] * df_int['ab_page']
df_int = df_int.drop('intercept', axis=1)
df_int.head()
Out[69]:
user_id converted ab_page UK US UK_ab_page US_ab_page
0 851104 0 0 0 1 0 0
1 804228 0 0 0 1 0 0
2 661590 0 1 0 1 0 1
3 853541 0 1 0 1 0 1
4 864975 1 0 0 1 0 0
In [70]:
df_int.corr(method='spearman')
Out[70]:
user_id converted ab_page UK US UK_ab_page US_ab_page
user_id 1.000000 0.001769 -0.001745 -0.002059 0.001187 -0.002814 0.000217
converted 0.001769 1.000000 -0.002432 0.001772 -0.000237 0.001827 -0.002563
ab_page -0.001745 -0.002432 1.000000 -0.002092 0.001216 0.376626 0.734811
UK -0.002059 0.001772 -0.002092 1.000000 -0.881979 0.653496 -0.423595
US 0.001187 -0.000237 0.001216 -0.881979 1.000000 -0.576370 0.480278
UK_ab_page -0.002814 0.001827 0.376626 0.653496 -0.576370 1.000000 -0.276818
US_ab_page 0.000217 -0.002563 0.734811 -0.423595 0.480278 -0.276818 1.000000

Put your answer here

This question is asking what the covariance is between the country and the page. For example, do people from the UK like the new_page while people from the USA like the old_page? Here are my findings.

  • None of the p_values are significant.
  • When you look at the ab_page (conversion page), there is a strong correlation between people in the US visiting the new_page and converting there (.734811). It is possible that if you cut this data simply on USA you would find that there is strong reason to suggest that USA visitors should be shown the new_page. I will do a quick test on this below here.
In [71]:
df2.head(2)
Out[71]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
In [72]:
df_usa = df2.merge(cdf, on='user_id', how='inner')
df_usa = df_usa.query('country == "US"')
df_usa.head(2)
Out[72]:
user_id timestamp group landing_page converted country
0 851104 2017-01-21 22:11:48.556739 control old_page 0 US
1 804228 2017-01-12 08:01:45.159739 control old_page 0 US
In [73]:
n_old = df_usa.query('group == "control"').count().tolist()
n_old = n_old[0]
n_old
Out[73]:
101716
In [74]:
convert_old = df_usa.query('group == "control" and converted == 1').count().tolist()
convert_old = convert_old[0]
convert_old
Out[74]:
12270
In [75]:
n_new = df_usa.query('group == "treatment"').count().tolist()
n_new = n_new[0]
n_new
Out[75]:
101903
In [76]:
convert_new = df_usa.query('group == "treatment" and converted == 1').count()
convert_new = convert_new[0]
convert_new
Out[76]:
12072
In [77]:
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
z_score, p_value
Out[77]:
(1.5051934564639167, 0.9338628264991603)

The ztest yields a p_value of .9339. That is > a .05 level of significance. We would accept the null hypothesis_USA. People from the USA are no more likely to click on the new_page than any other country and the old_page is as good or better than the new_page.

I am Canadian. I want to see this question explicitly answered from a Canadian point of view.

In [78]:
cdf = pd.read_csv('./countries.csv')
In [79]:
cdf.head(2)
Out[79]:
user_id country
0 834778 UK
1 928468 US
In [80]:
cdf2 = df2c.merge(cdf, on='user_id', how='inner')
cdf2.head(2)
Out[80]:
user_id converted ab_page intercept country
0 851104 0 0 1 US
1 804228 0 0 1 US
In [81]:
cdf2.isnull().sum()
Out[81]:
user_id      0
converted    0
ab_page      0
intercept    0
country      0
dtype: int64
In [82]:
cdf2.country.unique()
Out[82]:
array(['US', 'CA', 'UK'], dtype=object)
In [83]:
cdf2[['CA', 'UK', 'US']] = pd.get_dummies(cdf2['country'])
cdf2 = cdf2.drop(['country', 'US'], axis=1)
In [84]:
cdf2.head()
Out[84]:
user_id converted ab_page intercept CA UK
0 851104 0 0 1 0 0
1 804228 0 0 1 0 0
2 661590 0 1 1 0 0
3 853541 0 1 1 0 0
4 864975 1 0 1 0 0
In [85]:
log_mod = sm.Logit(cdf2['converted'], cdf2[['intercept', 'ab_page', 'CA', 'UK']])
results = log_mod.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6
Out[85]:
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290580
Method: MLE Df Model: 3
Date: Fri, 29 Jun 2018 Pseudo R-squ.: 2.323e-05
Time: 11:51:24 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1760
coef std err z P>|z| [0.025 0.975]
intercept -1.9893 0.009 -223.763 0.000 -2.007 -1.972
ab_page -0.0149 0.011 -1.307 0.191 -0.037 0.007
CA -0.0408 0.027 -1.516 0.130 -0.093 0.012
UK 0.0099 0.013 0.743 0.457 -0.016 0.036
In [86]:
# Need to exponentiate in order to interpret them.
np.exp(-0.0149), np.exp(-0.0408), np.exp(0.0099)
Out[86]:
(0.9852104557227469, 0.9600211149716509, 1.0099491671175422)
In [87]:
# Explain negative, easier to explain as 1/np.exp()
1/np.exp(-0.0149), 1/np.exp(-0.0408)
Out[87]:
(1.0150115583846535, 1.0416437559600236)

Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.

  • For the UK variable (if the user was from the UK) conversion is 1.0099 times as likely, holding all else constant.
  • When you have values that are less than one it is often advantageous to compute the reciprocal in order to explain the results more clearly. For each 1 unit decrease in new_page, conversion is 1.015 times as likely holding all else constant.
  • For each 1 unit decrease in new_page for Canadians, conversion is 1.0416 times as likely holding all else constant.

None of these p_values are significant. They are all above .05. We would accept the null hypothesis. The new_page is not significantly better than the old_page.

h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.

Provide the summary results, and your conclusions based on the results.

In [88]:
df_int = cdf2.copy()
df_int.head()
Out[88]:
user_id converted ab_page intercept CA UK
0 851104 0 0 1 0 0
1 804228 0 0 1 0 0
2 661590 0 1 1 0 0
3 853541 0 1 1 0 0
4 864975 1 0 1 0 0
In [89]:
log_mod = sm.Logit(df_int['converted'], df_int[['intercept', 'ab_page', 'CA', 'UK']])
results = log_mod.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6
Out[89]:
Logit Regression Results
Dep. Variable: converted No. Observations: 290584
Model: Logit Df Residuals: 290580
Method: MLE Df Model: 3
Date: Fri, 29 Jun 2018 Pseudo R-squ.: 2.323e-05
Time: 11:51:26 Log-Likelihood: -1.0639e+05
converged: True LL-Null: -1.0639e+05
LLR p-value: 0.1760
coef std err z P>|z| [0.025 0.975]
intercept -1.9893 0.009 -223.763 0.000 -2.007 -1.972
ab_page -0.0149 0.011 -1.307 0.191 -0.037 0.007
CA -0.0408 0.027 -1.516 0.130 -0.093 0.012
UK 0.0099 0.013 0.743 0.457 -0.016 0.036
  • None of the p_values are significant.
In [90]:
df_int['UK_ab_page'] = df_int['UK'] * df_int['ab_page']
df_int['CA_ab_page'] = df_int['CA'] * df_int['ab_page']
df_int = df_int.drop('intercept', axis=1)
df_int.head()
Out[90]:
user_id converted ab_page CA UK UK_ab_page CA_ab_page
0 851104 0 0 0 0 0 0
1 804228 0 0 0 0 0 0
2 661590 0 1 0 0 0 0
3 853541 0 1 0 0 0 0
4 864975 1 0 0 0 0 0
In [91]:
df_int.corr(method='spearman')
Out[91]:
user_id converted ab_page CA UK UK_ab_page CA_ab_page
user_id 1.000000 0.001769 -0.001745 0.001596 -0.002059 -0.002814 -0.000303
converted 0.001769 1.000000 -0.002432 -0.003022 0.001772 0.001827 -0.003807
ab_page -0.001745 -0.002432 1.000000 0.001600 -0.002092 0.376626 0.160519
CA 0.001596 -0.003022 0.001600 1.000000 -0.132090 -0.086320 0.700541
UK -0.002059 0.001772 -0.002092 -0.132090 1.000000 0.653496 -0.092534
UK_ab_page -0.002814 0.001827 0.376626 -0.086320 0.653496 1.000000 -0.060471
CA_ab_page -0.000303 -0.003807 0.160519 0.700541 -0.092534 -0.060471 1.000000

Put your answer here

This question is asking what the covariance is between the country and the page. For example, do people from the UK like the new_page while people from the USA like the old_page? Here are my findings.

  • None of the p_values are significant.
  • When you look at the ab_page (conversion page), there is little correlation between people in Canada visiting the new_page and converting there (.160519).

Are Date and Time Correlated With Conversion Rates?

The process is:

  • Find the range of days
  • If manageable, bar chart each day for successful conversions.
  • Hopefully it looks significant.
  • Conduct z_tests in a for loop that yields the p_value for each day.
  • Sort the p_values
  • See if any are any significant p_values and if there are any patterns surrounding the values. For example do you see a steady building or deterioration of conversions.
In [92]:
df2.head(2)
Out[92]:
user_id timestamp group landing_page converted
0 851104 2017-01-21 22:11:48.556739 control old_page 0
1 804228 2017-01-12 08:01:45.159739 control old_page 0
In [93]:
df2s = df2.sort_values('timestamp')
df2s.iloc[0], df2s.iloc[-1]
Out[93]:
(user_id                             922696
 timestamp       2017-01-02 13:42:05.378582
 group                            treatment
 landing_page                      new_page
 converted                                0
 Name: 131228, dtype: object, user_id                             920411
 timestamp       2017-01-24 13:41:54.460509
 group                              control
 landing_page                      old_page
 converted                                0
 Name: 193652, dtype: object)
In [94]:
df2s['day'] = df2s['timestamp'].apply(lambda x: x[8:10])
df2s.head()
Out[94]:
user_id timestamp group landing_page converted day
131228 922696 2017-01-02 13:42:05.378582 treatment new_page 0 02
184884 781507 2017-01-02 13:42:15.234051 control old_page 0 02
83878 737319 2017-01-02 13:42:21.786186 control old_page 0 02
102717 818377 2017-01-02 13:42:26.640581 treatment new_page 0 02
158789 725857 2017-01-02 13:42:27.851110 treatment new_page 0 02
In [95]:
df2g = df2s.groupby(['day', 'group']).count().reset_index()
df2g = df2g.drop(['user_id', 'timestamp', 'landing_page'], axis=1)
df2g.head()
Out[95]:
day group converted
0 02 control 2859
1 02 treatment 2853
2 03 control 6590
3 03 treatment 6618
4 04 control 6578
In [96]:
locations = df2g.day.unique()
heights_control = df2g.query('group == "control"')['converted'].tolist()
heights_treatment = df2g.query('group == "treatment"')['converted'].tolist()
labels = range(2, 25)
plt.bar(locations, height=heights_control, tick_label=labels, color = 'red', alpha=.25)
plt.bar(locations, height=heights_treatment, tick_label=labels, color='blue', alpha=.25);

Not much to see here.

In [97]:
zp = []
for value in locations:
    convert_old = df2s.query('group == "control" and converted == 1 and day == @value').shape[0]
    convert_new = df2s.query('group == "treatment" and converted == 1 and day == @value').shape[0]
    n_old = df2s.query('group == "control" and day == @value').shape[0]
    n_new = df2s.query('group == "treatment" and day == @value').shape[0]
    z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
    zp.append((value, z_score, p_value))
In [98]:
pd.DataFrame(zp, columns = ['day', 'zscore', 'pvalue']).sort_values('pvalue')
Out[98]:
day zscore pvalue
8 10 -2.399614 0.008206
4 06 -1.445197 0.074201
15 17 -0.761222 0.223262
22 24 -0.490961 0.311727
18 20 -0.436937 0.331078
6 08 -0.331142 0.370268
10 12 -0.051832 0.479331
16 18 0.002664 0.501063
1 03 0.005103 0.502036
20 22 0.205629 0.581460
7 09 0.280733 0.610542
14 16 0.467850 0.680054
17 19 0.483645 0.685681
0 02 0.655830 0.744033
9 11 0.679542 0.751603
21 23 0.809533 0.790896
5 07 0.850097 0.802364
2 04 0.931604 0.824229
11 13 1.017846 0.845625
13 15 1.261408 0.896419
12 14 1.308438 0.904638
3 05 1.446820 0.926026
19 21 1.812241 0.965026

For one brilliant day, Jan 10, 2017, the new_page had a significant p_value of .008 which is < .05. However, there were no other days that exhibited this behaviour. Nor is there a pattern that arises from this day. E.g. building or subsiding from it. There are also no weekend or weekday patterns that emerged. We would accept the null hypothesis. There are no significant time related changes in conversion rates in this dataset.

Conclusions

Based on simulatons, logistical regression models (Logit) and z_tests we do not see any significant results. The p_values in all cases, except for one (January 10, 2017) were not signficant. We would accept the null hypothesis. The old_page is just as good if not better than the new_page.

In [99]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])
Out[99]:
4294967295