Analyze A/B Test Results¶

Table of Contents¶

Introduction
Part I - Probability
Part II - A/B Test
Part III - Regression

Introduction¶

The p_values in the text area may be slightly inconsistent. This is because many simulaion runs were done after the text was written. However, there are no signfiicant changes in the p_values due to these rerunning of the simulations.

Part I - Probability¶

Import required libraries.

import pandas as pd
import numpy as np
import random
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
random.seed(42)

1. Read in the dataset and take a look at the top few rows.

df = pd.read_csv('./ab_data.csv')
df.head()

b. Use the below cell to find the number of rows in the dataset.

nof_rows = df.shape[0]
nof_rows

294478

c. The number of unique users in the dataset.

df.user_id.nunique()

290584

d. The proportion of users converted.

df.converted.mean()

0.11965919355605512

e. The number of times the new_page and treatment don't line up.

df.groupby(['group', 'landing_page']).count()

dont_line_up = 1928 + 1965
dont_line_up

3893

f. Do any of the rows have missing values?

df.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page. Store new dataframe in df2.

df2 = df[df[['group', 'landing_page']].apply(lambda x: x[0] == 'control' and x[1] == 'old_page' or x[0] == 'treatment' and x[1] == 'new_page', axis=1)]
df2.head(10)

# Double Check all of the incorrect rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

df2.shape

(290585, 5)

How many unique user_ids are in df2?

df2['user_id'].nunique()

290584

There is one user_id repeated in df2.

df2[df2['user_id'].duplicated()].count()

user_id         1
timestamp       1
group           1
landing_page    1
converted       1
dtype: int64

doubles = df2[df2.user_id.duplicated()|df2.user_id.duplicated()]

What is the row information for the repeat user_id?

doubles

df2.shape

(290585, 5)

d. Remove one of the rows with a duplicate user_id, but keep your dataframe as df2.

df2 = df2.drop_duplicates(subset='user_id', keep='first')

df2[df2['user_id'].duplicated()].count()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

What is the probability of an individual converting regardless of the page they receive?

df2['converted'].mean()

0.11959708724499628

df2.head()

Given that an individual was in the control group, what is the probability they converted?

ccr = df2.query('group == "control"')['converted'].mean()
ccr

0.1203863045004612

Given that an individual was in the treatment group, what is the probability they converted?

tcr = df2.query('group == "treatment"')['converted'].mean()
tcr

0.11880806551510564

obs_diff = tcr - ccr
obs_diff

-0.0015782389853555567

What is the probability that an individual received the new page?

df2_groupby = df2.groupby('landing_page').count()
df2_groupby

new_page_prob = df2_groupby.user_id.iloc[0]/(df2_groupby.user_id.iloc[0] + df2_groupby.user_id.iloc[1])
new_page_prob

0.5000619442226688

e. Use the results in the previous two portions of this question to suggest if you think there is evidence that one page leads to more conversions? Write your response below.

Your answer goes here.

The overall conversion rate is 0.11965919355605512 BEFORE you clean up the data.
AFTER you clean up the data the overall conversion rate is 0.11959708724499628
conversion in the control group with Old Page is 0.1203863045004612.
conversion in the treatment group with New Page is 0.11880806551510564

This is all very close. The new page MAY or MAY NOT resulte in a different conversion rate. Further study needs to be done. HOWEVER, it does not look promising for the new page so far.

Part II - A/B Test¶

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another?

These questions are the difficult parts associated with A/B tests in general.

1. For now, consider you need to make the decision just based on all the data provided. If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be? You can state your hypothesis in terms of words or in terms of $p_{old}$ and $p_{new}$, which are the converted rates for the old and new pages.

Put your answer here. $$H_0: p_{new} - p_{old} \leq 0$$

$$H_0: p_{new} - p_{old} > 0$$

2. Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the converted success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the converted rate in ab_data.csv regardless of the page.

Use a sample size for each page equal to the ones in ab_data.csv.

Perform the sampling distribution for the difference in converted between the two pages over 10,000 iterations of calculating an estimate from the null.

Use the cells below to provide the necessary parts of this simulation. If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem. You can use Quiz 5 in the classroom to make sure you are on the right track.

# Exercise states "Assume under the null hypothesis,  p_new and p_old both have
# "true" success rates equal to the converted success rate regardless of page - 
# that is  p_new  and  p_old  are equal.
proportion = df.converted.mean()
proportion

0.11965919355605512

df2.head()

treatment_df = df2.query('group == "treatment"')
treatment_cr = treatment_df.converted.mean()
treatment_cr

0.11880806551510564

control_df = df2.query('group == "control"')
control_cr = control_df.converted.mean()
control_cr

0.1203863045004612

difference = treatment_cr - control_cr
difference

-0.0015782389853555567

a. What is the convert rate for $p_{new}$ under the null?

p_new = df2['converted'].mean() # Same as Part 1 4a
p_new

0.11959708724499628

b. What is the convert rate for $p_{old}$ under the null?

p_old = df2['converted'].mean() # Same as Part 1 4a
p_old

0.11959708724499628

c. What is $n_{new}$?

n_new = treatment_df.shape[0]
n_new

145310

d. What is $n_{old}$?

n_old = control_df.shape[0]
n_old

145274

e. Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in new_page_converted.

new_page_converted = np.random.binomial(1, p_new, n_new)
tsm = new_page_converted.mean()
tsm

0.12058357993255799

f. Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in old_page_converted.

old_page_converted = np.random.binomial(1, p_old, n_old)
csm = old_page_converted.mean()
csm

0.11826617288709611

g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

# compute simulated difference in conversion rate
sim_diff = tsm - csm

# display simulation difference
sim_diff

0.002317407045461878

h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts a. through g. above. Store all 10,000 values in p_diffs.

p_diffs = []
for _ in range(10000):
    old_page_converted = np.random.binomial(1, p_old, n_old)
    csm_b = old_page_converted.mean()
    new_page_converted = np.random.binomial(1, p_new, n_new)
    tsm_b = new_page_converted.mean()
    p_diffs.append(tsm_b - csm_b)

# The numpy way to simulate the above. Does not require a for loop. It also yields the wrong answer (sigh).
# new_converted_simulation = np.random.binomial(n_new, p_new, 10000)/n_new
# old_converted_simulation = np.random.binomial(n_old, p_old, 10000)/n_old
# p_diffs = new_converted_simulation - old_converted_simulation

df.head()

p_diffs = np.array(p_diffs)
pdm = p_diffs.mean()
pdm

1.655761946158373e-05

p_diffs.std()

0.0011957193070001278

i. Plot a histogram of the p_diffs. Does this plot look like what you expected? Use the matching problem in the classroom to assure you fully understand what was computed here.

plt.hist(p_diffs);

This is what I expected. A very symmetrical normal looking distribution!

j. What proportion of the p_diffs are greater than the actual difference observed in ab_data.csv?

# Same as Part I 4c minus Part I 4b.
ccr = df2.query('group == "control"')['converted'].mean()
tcr = df2.query('group == "treatment"')['converted'].mean()
obs_diff = tcr - ccr
obs_diff

-0.0015782389853555567

# compute p value
pvalue = (p_diffs > obs_diff).mean()
pvalue

0.9105

# plot line for observed statistic
plt.hist(p_diffs, alpha=.5)
plt.axvline(x=obs_diff, color='red');

k. In words, explain what you just computed in part j.. What is this value called in scientific studies? What does this value mean in terms of whether or not there is a difference between the new and old pages?

Put your answer here.

In part j we calculated what proportion of the p_diffs array (simulated difference in means) is greater than the observed difference. That is the actual difference from the dataset between the control group conversion rate and the treatment group conversion rate. This is NOT the simulated difference calculated above (sim_diff). The p_value is 0.9038.
In scientific studies the definition of the p-value is and this is from Investopedia https://www.investopedia.com/terms/p/p-value.asp). "The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected."
We computed a p_value of .9038 which means that 90.38% of the differences can be attributed to pure chance. 90.38% of Part II 2h's p_diffs (difference in means) is higher than the observed difference we see in our dataset. This was only a simulation. Not the actual data. The observed difference in the data is only higher than ~ 9% of the simulated difference assuming both pages conversion rates are the same. This is well above the .05 null hypothesis threshold. This is not a significant result. We would accept the null hypothesis.

l. We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. Fill in the below to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let n_old and n_new refer the the number of rows associated with the old page and new pages, respectively.

df2.head()

import statsmodels.api as sm

convert_old = df2.query('group == "control" and converted == 1').shape[0]
convert_new = df2.query('group == "treatment" and converted == 1').shape[0]
n_old = control_df.shape[0]
n_new = treatment_df.shape[0]
print(convert_old, convert_new, n_old, n_new)

C:\Users\linds\Anaconda2\envs\ipykernel_py3\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

17489 17264 145274 145310

m. Now use stats.proportions_ztest to compute your test statistic and p-value. Here is a helpful link on using the built in.

z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
z_score, p_value

(1.3109241984234394, 0.9050583127590245)

n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts j. and k.?

Put your answer here.

A z_score is the number of standard deviations away from the mean that the element is. In this case the element is the mean of the conversion events. With a z_score of 1.31 this means that this value lies between the 1st and 2nd standard deviations. A standard deviation of 1 contains ~68% of all elements. A standard deviation of 2 contains ~ 95% of all elements. This z_score puts the p_value comfortably in the distribution.
.9051 > .05. With a p_value of .9051 we accept the null hypothesis. That is the old_page is the same or better than the new_page.
Yes, p_values are virtually identical. In the ztest they are .9051 and with the simulation the p_value was .9038. This ztest does agree with the findings in Part II j and k.

Part III - A regression approach¶

1. In this final part, you will see that the result you acheived in the previous A/B test can also be acheived by performing regression.

a. Since each row is either a conversion or no conversion, what type of regression should you be performing in this case?

Put your answer here.

We would use a logistic regression model. The reason why we would use this versus Ordinairy Least Squares (OLS/Chi-square goodness of fit) is that we have a categorical dependent variable (conversion rate). We will use statsmodels Logit method.

b. The goal is to use statsmodels to fit the regression model you specified in part a. to see if there is a significant difference in conversion based on which page a customer receives. However, you first need to create a column for the intercept, and create a dummy variable column for which page each user received. Add an intercept column, as well as an ab_page column, which is 1 when an individual receives the treatment and 0 if control.

# Painful to rerun simulation if I screw up df2. Make a copy. Also avoids hidden 
# memory change (or not change) issues.
df2c = df2.copy()
df2c.head(2)

df2c[['group_1', 'group_2' ]] = pd.get_dummies(df2c.group)
df2c[['lp1', 'lp2' ]] = pd.get_dummies(df2c.landing_page)
df2c = df2c.rename(columns={'group_2': 'ab_page', 'lp1': 'treatment'})
df2c = df2c.drop(['timestamp','group',  'landing_page', 'lp2', 'group_1', 'treatment'], axis=1)

df2c.head()

df2c['intercept'] = 1

c. Use statsmodels to import your regression model. Instantiate the model, and fit the model using the two columns you created in part b. to predict whether or not an individual converts.

log_mod = sm.Logit(df2c['converted'], df2c[['intercept', 'ab_page']])
results = log_mod.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6

d. Provide the summary of your model below, and use it as necessary to answer the following questions.

# Need to exponentiate in order to interpret them.
np.exp(-1.9888), np.exp(-0.0150)

(0.13685955841614492, 0.9851119396030626)

# Since they are negative, easier to explain as 1/np.exp()
1/np.exp(-1.9888), 1/np.exp(-0.0150)

(7.3067603868728614, 1.015113064615719)

Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.

When you have values that are less than one it is often advantageous to compute the reciprocal in order to explain the results more clearly. For each 1 unit decrease in ab_page, conversion is 1.015 times as likely holding all else constant. Not much of an impact.

e. What is the p-value associated with ab_page?

p-value = .19

Put your answer here.¶

In Part II the p_value is from .9051 to .9038. In Part III the p_value is .19.
The reason is in Part II we constructed a hypothesis that would only be tested in one direction. We did this because we wanted to prove the Alternate Hypothes is True. Ergo the new_page has a significantly higher conversion rate than the old_page. This is just one sided. It is not interested in the possibility that the new_page could be much worse than the old_page. If the new_page is much worse. No big deal. We just keep the old_page. Our hypotheses statements affect what our p-value is because it changes which side(s) (or both sides) from the distribution we are calculating the p-value from. The hypotheses in Part II is a one tail test. Part III is a two tail regression test.
You can equivalence the p-values using this math. (1 - (0.19/2) = 0.95).We calculated in Part II j a p_value 0f .9038 and a p_value of .9051 in Part II l. Close enough to .95 to be equivalent. In Part II we are concerned with which page had a higher conversion rate, so a one-tailed test. However, in Part III, a regression test is not concerned with a positive or negative change. It checks to see if the independent variable (ab_page/new_page) had any effect at all, so a two-tailed test.

f. Now, you are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add into your regression model. Are there any disadvantages to adding additional terms into your regression model?

Put your answer here.

One idea alluded to within this project is looking at the date and time information to see if that has correlations within it that changes the conversion rates in favor of the new_page. Timing of behaviour within an A/B test is often quite important. Different people respond differently to change. Some people like change (almost no matter what). Other people have a lot of inertia when it comes to change, even if it is positive. This data is time and date stamped. We will explore that idea below.
It is always good to reflect upon what your data seems to be telling you. Simpson's Paradox is an excellent example of that. Looking at the data within categories (for example), sometimes obscures the fact, that overall the answer to the question that you are asking is in effect opposite of what you observe in the specific catetegories.
Simpson's paradox (https://en.wikipedia.org/wiki/Simpson%27s_paradox), or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox.[1]
From https://en.wikipedia.org/wiki/Multiple_comparisons_problem In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously[1] or infers a subset of parameters selected based on the observed values.[2] In certain fields it is known as the look-elsewhere effect. The more inferences are made, the more likely erroneous inferences are to occur. This is a potential problem.
There might be behaviours associated with visiting the webpage that we are not capturing. Perhaps the question should be, given that this is a big decision for people, they are likely to visit the conversion page several times. Perhaps the question should be: "For those that are likely to convert (e.g. defined as more than one visit to the conversion page) which conversion page (old_page vs new_page) had more conversions." Unfortunately, this data does not seem to contain that information. However, in the real world, it is extremely likely that this data would be captured. Identifying unique visitors to a website and how many times they visited a page and what the clicked on, the duration etc., is a requirement for virtually all data gathering by organizations on their websites.
One likely disadvantage of adding additional variables to the model is covariance. That is two or more variables are correlated. Covariance causes interpretation difficulties and can and will skew R values. This creates challenges with creating appropriate statistical tests and interpreting them.

g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives. You will need to read in the countries.csv dataset and merge together your datasets on the approporiate rows. Here are the docs for joining tables.

Does it appear that country had an impact on conversion? Provide the statistical output as well as a written response to answer this question.

Answer provided after results.summary() below

cdf = pd.read_csv('./countries.csv')

cdf.head(2)

cdf2 = df2c.merge(cdf, on='user_id', how='inner')
cdf2.head(2)

cdf2.isnull().sum()

user_id      0
converted    0
ab_page      0
intercept    0
country      0
dtype: int64

cdf2.country.unique()

array(['US', 'CA', 'UK'], dtype=object)

cdf2[['CA', 'UK', 'US']] = pd.get_dummies(cdf2['country'])
cdf2 = cdf2.drop(['country', 'CA'], axis=1)

cdf2.head()

log_mod = sm.Logit(cdf2['converted'], cdf2[['intercept', 'ab_page', 'UK', 'US']])
results = log_mod.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6

# Need to exponentiate in order to interpret them.
np.exp(-0.0149), np.exp(0.0506), np.exp(0.0408)

(0.9852104557227469, 1.0519020483004984, 1.0416437559600236)

# Explain negative, easier to explain as 1/np.exp()
1/np.exp(-0.0149)

1.0150115583846535

Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.

For the UK variable (if the user was from the UK) conversion is 1.052 times as likely, holding all else constant.
For the US variable (if the user was from the US) conversion is 1.042 times as likely, holding all else constant.
When you have values that are less than one it is often advantageous to compute the reciprocal in order to explain the results more clearly. For each 1 unit decrease in new_page, conversion is 1.015 times as likely holding all else constant.

None of these p_values are significant. They are all above .05. We would accept the null hypothesis. The new_page is not significantly better than the old_page.

h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.

Provide the summary results, and your conclusions based on the results.

df_int = cdf2.copy()
df_int.head()

log_mod = sm.Logit(df_int['converted'], df_int[['intercept', 'ab_page', 'UK',  'US']])
results = log_mod.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6

None of the p_values are significant.

df_int['UK_ab_page'] = df_int['UK'] * df_int['ab_page']
df_int['US_ab_page'] = df_int['US'] * df_int['ab_page']
df_int = df_int.drop('intercept', axis=1)
df_int.head()

df_int.corr(method='spearman')

Put your answer here¶

This question is asking what the covariance is between the country and the page. For example, do people from the UK like the new_page while people from the USA like the old_page? Here are my findings.

None of the p_values are significant.
When you look at the ab_page (conversion page), there is a strong correlation between people in the US visiting the new_page and converting there (.734811). It is possible that if you cut this data simply on USA you would find that there is strong reason to suggest that USA visitors should be shown the new_page. I will do a quick test on this below here.

df2.head(2)

df_usa = df2.merge(cdf, on='user_id', how='inner')
df_usa = df_usa.query('country == "US"')
df_usa.head(2)

n_old = df_usa.query('group == "control"').count().tolist()
n_old = n_old[0]
n_old

101716

convert_old = df_usa.query('group == "control" and converted == 1').count().tolist()
convert_old = convert_old[0]
convert_old

12270

n_new = df_usa.query('group == "treatment"').count().tolist()
n_new = n_new[0]
n_new

101903

convert_new = df_usa.query('group == "treatment" and converted == 1').count()
convert_new = convert_new[0]
convert_new

12072

z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
z_score, p_value

(1.5051934564639167, 0.9338628264991603)

The ztest yields a p_value of .9339. That is > a .05 level of significance. We would accept the null hypothesis_USA. People from the USA are no more likely to click on the new_page than any other country and the old_page is as good or better than the new_page.

I am Canadian. I want to see this question explicitly answered from a Canadian point of view.

cdf = pd.read_csv('./countries.csv')

cdf.head(2)

cdf2 = df2c.merge(cdf, on='user_id', how='inner')
cdf2.head(2)

cdf2.isnull().sum()

user_id      0
converted    0
ab_page      0
intercept    0
country      0
dtype: int64

cdf2.country.unique()

array(['US', 'CA', 'UK'], dtype=object)

cdf2[['CA', 'UK', 'US']] = pd.get_dummies(cdf2['country'])
cdf2 = cdf2.drop(['country', 'US'], axis=1)

cdf2.head()

log_mod = sm.Logit(cdf2['converted'], cdf2[['intercept', 'ab_page', 'CA', 'UK']])
results = log_mod.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6

# Need to exponentiate in order to interpret them.
np.exp(-0.0149), np.exp(-0.0408), np.exp(0.0099)

(0.9852104557227469, 0.9600211149716509, 1.0099491671175422)

# Explain negative, easier to explain as 1/np.exp()
1/np.exp(-0.0149), 1/np.exp(-0.0408)

(1.0150115583846535, 1.0416437559600236)

Each of these exponentiated values is the multiplicative change in the odds of conversion occurring.

For the UK variable (if the user was from the UK) conversion is 1.0099 times as likely, holding all else constant.
When you have values that are less than one it is often advantageous to compute the reciprocal in order to explain the results more clearly. For each 1 unit decrease in new_page, conversion is 1.015 times as likely holding all else constant.
For each 1 unit decrease in new_page for Canadians, conversion is 1.0416 times as likely holding all else constant.

None of these p_values are significant. They are all above .05. We would accept the null hypothesis. The new_page is not significantly better than the old_page.

h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. Create the necessary additional columns, and fit the new model.

Provide the summary results, and your conclusions based on the results.

df_int = cdf2.copy()
df_int.head()

log_mod = sm.Logit(df_int['converted'], df_int[['intercept', 'ab_page', 'CA', 'UK']])
results = log_mod.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6

None of the p_values are significant.

df_int['UK_ab_page'] = df_int['UK'] * df_int['ab_page']
df_int['CA_ab_page'] = df_int['CA'] * df_int['ab_page']
df_int = df_int.drop('intercept', axis=1)
df_int.head()

df_int.corr(method='spearman')

Put your answer here¶

This question is asking what the covariance is between the country and the page. For example, do people from the UK like the new_page while people from the USA like the old_page? Here are my findings.

None of the p_values are significant.
When you look at the ab_page (conversion page), there is little correlation between people in Canada visiting the new_page and converting there (.160519).

Are Date and Time Correlated With Conversion Rates?¶

The process is:

Find the range of days
If manageable, bar chart each day for successful conversions.
Hopefully it looks significant.
Conduct z_tests in a for loop that yields the p_value for each day.
Sort the p_values
See if any are any significant p_values and if there are any patterns surrounding the values. For example do you see a steady building or deterioration of conversions.

df2.head(2)

df2s = df2.sort_values('timestamp')
df2s.iloc[0], df2s.iloc[-1]

(user_id                             922696
 timestamp       2017-01-02 13:42:05.378582
 group                            treatment
 landing_page                      new_page
 converted                                0
 Name: 131228, dtype: object, user_id                             920411
 timestamp       2017-01-24 13:41:54.460509
 group                              control
 landing_page                      old_page
 converted                                0
 Name: 193652, dtype: object)

df2s['day'] = df2s['timestamp'].apply(lambda x: x[8:10])
df2s.head()

df2g = df2s.groupby(['day', 'group']).count().reset_index()
df2g = df2g.drop(['user_id', 'timestamp', 'landing_page'], axis=1)
df2g.head()

locations = df2g.day.unique()
heights_control = df2g.query('group == "control"')['converted'].tolist()
heights_treatment = df2g.query('group == "treatment"')['converted'].tolist()
labels = range(2, 25)
plt.bar(locations, height=heights_control, tick_label=labels, color = 'red', alpha=.25)
plt.bar(locations, height=heights_treatment, tick_label=labels, color='blue', alpha=.25);

Not much to see here.

zp = []
for value in locations:
    convert_old = df2s.query('group == "control" and converted == 1 and day == @value').shape[0]
    convert_new = df2s.query('group == "treatment" and converted == 1 and day == @value').shape[0]
    n_old = df2s.query('group == "control" and day == @value').shape[0]
    n_new = df2s.query('group == "treatment" and day == @value').shape[0]
    z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
    zp.append((value, z_score, p_value))

pd.DataFrame(zp, columns = ['day', 'zscore', 'pvalue']).sort_values('pvalue')

For one brilliant day, Jan 10, 2017, the new_page had a significant p_value of .008 which is < .05. However, there were no other days that exhibited this behaviour. Nor is there a pattern that arises from this day. E.g. building or subsiding from it. There are also no weekend or weekday patterns that emerged. We would accept the null hypothesis. There are no significant time related changes in conversion rates in this dataset.

Conclusions¶

Based on simulatons, logistical regression models (Logit) and z_tests we do not see any significant results. The p_values in all cases, except for one (January 10, 2017) were not signficant. We would accept the null hypothesis. The old_page is just as good if not better than the new_page.

from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])

4294967295

	user_id	converted	ab_page
0	851104	0	0
1	804228	0	0
2	661590	0	1
3	853541	0	1
4	864975	1	0

Dep. Variable:	converted	No. Observations:	290584
Model:	Logit	Df Residuals:	290582
Method:	MLE	Df Model:	1
Date:	Fri, 29 Jun 2018	Pseudo R-squ.:	8.077e-06
Time:	11:51:15	Log-Likelihood:	-1.0639e+05
converged:	True	LL-Null:	-1.0639e+05
		LLR p-value:	0.1899

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-1.9888	0.008	-246.669	0.000	-2.005	-1.973
ab_page	-0.0150	0.011	-1.311	0.190	-0.037	0.007

Dep. Variable:	converted	No. Observations:	290584
Model:	Logit	Df Residuals:	290580
Method:	MLE	Df Model:	3
Date:	Fri, 29 Jun 2018	Pseudo R-squ.:	2.323e-05
Time:	11:51:18	Log-Likelihood:	-1.0639e+05
converged:	True	LL-Null:	-1.0639e+05
		LLR p-value:	0.1760

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	-2.0300	0.027	-76.249	0.000	-2.082	-1.978
ab_page	-0.0149	0.011	-1.307	0.191	-0.037	0.007
UK	0.0506	0.028	1.784	0.074	-0.005	0.106
US	0.0408	0.027	1.516	0.130	-0.012	0.093

	user_id	timestamp	group	landing_page	converted
0	851104	2017-01-21 22:11:48.556739	control	old_page	0
1	804228	2017-01-12 08:01:45.159739	control	old_page	0
2	661590	2017-01-11 16:55:06.154213	treatment	new_page	0
3	853541	2017-01-08 18:28:03.143765	treatment	new_page	0
4	864975	2017-01-21 01:52:26.210827	control	old_page	1

		user_id	timestamp	converted
group	landing_page
control	new_page	1928	1928	1928
control	old_page	145274	145274	145274
treatment	new_page	145311	145311	145311
treatment	old_page	1965	1965	1965

	user_id	timestamp	group	converted
landing_page
new_page	145310	145310	145310	145310
old_page	145274	145274	145274	145274

	user_id	converted	ab_page	UK	US	UK_ab_page	US_ab_page
user_id	1.000000	0.001769	-0.001745	-0.002059	0.001187	-0.002814	0.000217
converted	0.001769	1.000000	-0.002432	0.001772	-0.000237	0.001827	-0.002563
ab_page	-0.001745	-0.002432	1.000000	-0.002092	0.001216	0.376626	0.734811
UK	-0.002059	0.001772	-0.002092	1.000000	-0.881979	0.653496	-0.423595
US	0.001187	-0.000237	0.001216	-0.881979	1.000000	-0.576370	0.480278
UK_ab_page	-0.002814	0.001827	0.376626	0.653496	-0.576370	1.000000	-0.276818
US_ab_page	0.000217	-0.002563	0.734811	-0.423595	0.480278	-0.276818	1.000000

	user_id	timestamp	group	landing_page	day
131228	922696	2017-01-02 13:42:05.378582	treatment	new_page	02
184884	781507	2017-01-02 13:42:15.234051	control	old_page	02
83878	737319	2017-01-02 13:42:21.786186	control	old_page	02
102717	818377	2017-01-02 13:42:26.640581	treatment	new_page	02
158789	725857	2017-01-02 13:42:27.851110	treatment	new_page	02

	day	group	converted
0	02	control	2859
1	02	treatment	2853
2	03	control	6590
3	03	treatment	6618
4	04	control	6578

	day	zscore	pvalue
8	10	-2.399614	0.008206
4	06	-1.445197	0.074201
15	17	-0.761222	0.223262
22	24	-0.490961	0.311727
18	20	-0.436937	0.331078
6	08	-0.331142	0.370268
10	12	-0.051832	0.479331
16	18	0.002664	0.501063
1	03	0.005103	0.502036
20	22	0.205629	0.581460
7	09	0.280733	0.610542
14	16	0.467850	0.680054
17	19	0.483645	0.685681
0	02	0.655830	0.744033
9	11	0.679542	0.751603
21	23	0.809533	0.790896
5	07	0.850097	0.802364
2	04	0.931604	0.824229
11	13	1.017846	0.845625
13	15	1.261408	0.896419
12	14	1.308438	0.904638
3	05	1.446820	0.926026
19	21	1.812241	0.965026

	user_id	country
0	834778	UK
1	928468	US

	user_id	country
0	834778	UK
1	928468	US