I am using the European soccer database available from Kaggle. The questions that I am asking are as follows.
- What teams improved the most?
- What team attributes lead to the highest rank?
- Who are the best home teams in terms of % wins at home?
- Jupyter Notebooks 5.4.1
- Python 3.6.4
- Anaconda, Inc. 2.6.0
- IPython 6.2.1
- DB Browser for SQLiteVersion 3.10.1, Qt Version 5.7.1, SQLCipher Version 3.15.2
- Data available at https://www.kaggle.com/hugomathien/soccer/data
The Match data ranges from 2008 to 2016 while the Team Attribute data ONLY goes from 2011 to 2015. To come up with a rank I need to do it by season. As a result, I will only be reviewing 4 seasons worth of data as follows.
There are potentially 21 independent variables to be ranked against the one dependent variable which is rank improvement. The methodology is to:
- Compare each Team_Attribute against the ranks of the teams in THAT SEASON. Teams change radically year to year. It makes no sense to take a mean over a number of years and compare it to the final rank.
- I will use scatter plots to initially see if there are any correlations. If there are strong correlations for several features I will choose that subset of features to look at bar charts of those features against quartile rank. If there are few correlations revealed in the scatter plots I will run all of the features with bar charts.
- If the bar charts are sufficiently revealing I will choose 4 features (attributes) and run correlations between those pairs of attributes to see if they are related.
- If those bar charts are not helpful, I will run bar charts that are based on the difference between the max and min for each quartile and create a bar chart that shows which quartile owns what percentage of that difference.
- Finally, to ensure that I have revealed statistically valid relationships I will run Spearman and Pearson algorithms against the 4 seasons to see if they agree with my results.
- I will need automation to do the above. There are 20 features with 4 seasons each. There are 4 different sets of charts (scatter, bar, bar as a share of difference, and covariance relationship between features). Altogether there are roughly 140 charts that you can generate with this notebook.
- I made extensive use of stackoverflow (https://stackoverflow.com). However, there is no carte blanche copying of code in this notebook.
- For the Pearson and Spearman statistical analysis I used this source. https://www.datascience.com/learn-data-science/fundamentals/introduction-to-correlation-python-data-science Pearson measures the linear relationship between continuous variables. Spearman's rank correlation coefficient can be defined as a special case of Pearson ρ applied to ranked (sorted) variables. According to Introduction to Correlation by Ruslana Dalinina 01.31.17 'Unlike Pearson, Spearman's correlation is not restricted to linear relationships. It measures monotonic association (only strictly increasing or decreasing, but not mixed) between two variables and relies on the rank order of values. In other words, rather than comparing means and variances, Spearman's coefficient looks at the relative order of values for each variable. This makes it appropriate to use with both continuous and discrete data.'
- I used code from this URL on Kaggle https://www.kaggle.com/dimarudov/data-analysis-using-sql to help me understand how to access the SQLite database.
- I recommend that you do not enter the notebook and use "Run All" cells. There are a large number of plots to be ran. This will crush the kernel and cause subsequent cells to not run. It is best to step thru the notebook and run each cell in order. Otherwise some cells will fail and it will be confusing.
- While Jupyter Notebooks are an incredibly productive system for iterating on code, they tend to encourage less than great programming habits. Specifically name space issues. If you find that some of the code is failing, in most cases, you will only need to go to the top of the section that you are in and simply run that section again. If that does not work, look at the preceding cells until you find where match_df or team_da are referenced. Then run the notebook from there.
- I have also provided a large number of plots for the reader to review. I have provided a simple range function at the front of most of these plots. Just edit the numbers to see the plots. Usually they will be in groups of 4. So, range(4, 8) will get you the second feature and the years from 2011-2015.
import numpy as np
import pandas as pd
import sqlite3
import collections
import matplotlib.pyplot as plt
%matplotlib qt
import seaborn as sns
sns.set_style('darkgrid')
- I did the usual process of reviewing the data, looking for NaNs, deduping, running histograms, etc. Since I invested so much time automating plot creation I became quite familiar with the data. It is a very clean dataset.
- There was a huge amount of work that went into understanding this data. In order to make this notebook understandable much of that code is not provided here. I can provide the earlier iterations of the notebook for you to peruse if you wish.
I am only looking at the Matches and Team_Attributes table. The sql database does not include who is playing for what team. So, the Player_Attributes table is impossible to link to team outcomes. The Matches table has data from July 18, 2008 and ends in May 25, 2016 The Team attributes data does not start until February 22, 2010 and only goes until Sep 10, 2015. If I want full seasons to match our data that means I am restricted to July 1, 2011 to June 30, 2015. This is only 4 years. That cuts our data roughly in half from ~ 25,000 rows to ~ 14,000 rows for Matches (for example).
# Connect to the database
database = 'database.sqlite'
conn = sqlite3.connect(database)
tables = pd.read_sql("""SELECT *
FROM sqlite_master
WHERE type='table';""", conn)
tables
# Base table for understanding answer to Q1. How have the teams progressed/digressed over the years.
matches_df = pd.read_sql("""SELECT Match.id,
Country.name AS country_name,
League.name AS league_name,
season,
stage,
date,
HT.team_long_name AS home_team,
AT.team_long_name AS away_team,
home_team_goal,
away_team_goal
FROM Match
JOIN Country on Country.id = Match.country_id
JOIN League on League.id = Match.league_id
LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
ORDER by date;
""", conn)
matches_df.head()
I went thru the typical process of looking for nulls (NaNs), duplicates, values out of range, counting unique elements and looking at them etc. The data was quite clean. I only dropped one column in the Team Attributes table due to NaNs. The biggest challenge is that there are 21 features in this dataset and it is quite time consuming to understand which features are important and which are not. I built automated processes to do that.
matches_df.shape
matches_df.describe()
# Histograms for home_team_goal and away_team_goal
plt.figure(figsize=(8, 5))
plt.xlabel('Goals')
plt.ylabel('Matches')
plt.hist(matches_df['home_team_goal'], alpha=.5, label='home_team_goal')
plt.hist(matches_df['away_team_goal'], alpha=.5, label='away_team_goal')
plt.title('Histogram of Match Table Numeric Features')
plt.legend();
As expected home teams score more goals than away teams!
matches_df.isnull().sum()
The above is great! Not a single null (NaN)!
# Check for duplicates
matches_df.duplicated().sum()
# Check for bad values on season. All are good.
matches_df['season'].unique()
matches_df['stage'].unique()
The Match table columns that I am using are very clean. No nulls (NaNs), duplicates, or funny values. The only issue is the lack of matching data in the Team_Attributes table.
ta_df = pd.read_sql("""SELECT Team.team_long_name, Team_Attributes.*
FROM Team_Attributes
JOIN Team
ON Team.team_api_id = Team_Attributes.team_api_id
ORDER BY Team.team_long_name
;""", conn)
ta_df.head()
ta_df.describe()
# no text fields, numeric only
features = ['buildUpPlaySpeed',
'buildUpPlayPassing',
'chanceCreationPassing',
'chanceCreationCrossing',
'chanceCreationShooting',
'defencePressure',
'defenceAggression',
'defenceTeamWidth']
# Histograms for Team_Attributes
for feature in features:
plt.figure(figsize=(8, 5))
plt.xlabel('Value')
plt.ylabel('Counts')
plt.hist(ta_df[feature], alpha=.5, label=feature)
plt.title('Histogram of ' + feature + ' From Team Attributes Table')
plt.legend();
There are very large concentrations of data points in the centres of most of the ta_df histograms. These histograms are directly derived from the Team Attributes table.
ta_df.isnull().sum()
With buildUpPlayDribbling having 969 NaNs in a total of 1458 rows, it makes no sense to keep this column. This attribute is still captured by buildUpPlayDribblingClass.
ta_df.drop(columns='buildUpPlayDribbling', inplace=True)
ta_df.head()
# Check for duplicates
ta_df.duplicated().sum()
I am fortunate, there are no real major data issues. The biggest challenge is that I lose half our data due to needing to compare attributes to matches. I now only have 4 seasons of data.
- We will be starting with Qestion 1 which requires us to come up with relative rank for each team comparing the 2011/2012 season rank to the 2014/2015 rank and seeing which teams have the largest relative movement.
- From there we will answer Question 2 (correlations to rank), and finally Question 3, which team has the best home record in terms of wins.
- This will be an iterative process as we see what comes out from each step.
To answer this I created a team record for each season (wins, draws, and losses). Then I ranked the team in 2011/2012 and then in 2014/2015. This is the time period that I have overlapping data from the Team_Attributes table. Then I produced a rank difference.
# All complete season points
df = matches_df.copy()
df['points'] = df[['home_team_goal', 'away_team_goal']].apply(lambda row: 2 if row[0] > row[1] else (1 if row[0] == row[1] else 0), axis=1)
df = df.sort_values(['season', 'points'], ascending=[True, False])
df = df.groupby(['season', 'home_team']).sum().reset_index()
df = df[['season', 'home_team', 'points']]
df.head(2)
# Create ranks for each season.
seasons = df.season.unique().tolist()
seasons
# Only want overlapping data with Team_Attributres table
seasons = seasons[3:7]
seasons
# Input is matches_df_gb and seasons.
rank_year = []
for season in seasons:
year = df.loc[df['season'] == season]
year.insert(3, 'rank', range(1, 1 + len(year)))
rank_year.append(year)
# Relative Rank between 2011/2012 season and 2014/2015 season.
# rename 2011/2012 columns
mdf_1112 = rank_year[0].copy()
mdf_1112.rename(columns=lambda x: x[:15] + "_1112", inplace=True)
mdf_1415 = rank_year[3].copy()
mdf_1415.rename(columns=lambda x: x[:15] + "_1415", inplace=True)
df_combined = mdf_1112.merge(mdf_1415, left_on='home_team_1112', right_on='home_team_1415', how='inner')
slim_df = df_combined[['home_team_1112', 'rank_1112', 'rank_1415']]
sorted_slim_df = slim_df.sort_values('home_team_1112')
sorted_slim_df['rank_difference'] = sorted_slim_df['rank_1112'] - sorted_slim_df['rank_1415']
sorted_slim_df = sorted_slim_df.sort_values('rank_difference', ascending=False).head(10)
sorted_slim_df.columns = ['team_name', '2011_2012_rank', '2014_2015_rank', 'rank_difference']
sorted_slim_df.head(10)
This simple table answers the question perfectly. No need to have a visualization.
Our definition of leading to the highest rank is for the seasons 2011-2015 (4 seasons) as detailed earlier. The dependent variable is the rank and in this case the quartile rank. I am looking for those attributes (features) that appear to correlate to that higher rank. There are 20 features in this table that we are using. In order for us to understand which ones are important will require us to look at a lot of plots in an automated fashion. Not only are there many features, there are also 4 seasons for each feature AND 12 of the features are text and in order to evaluate them I need to convert those text fields to a numeric value.
ta_df.head()
I want to see if there are any obvious correlations between rank and features using scatter plots.
ta_df.head(3)
def calc_season(date_value):
# Calculate the season for the row
date = date_value.split()
ymd = date[0].replace('-', '')
ymd = int(ymd)
if ymd >= 20110701 and ymd <20120701:
season = '2011/2012'
elif ymd <20130701:
season = '2012/2013'
elif ymd <20140701:
season = '2013/2014'
elif ymd <20150701:
season = '2014/2015'
else:
season = None
return season
rsl = []
frsl = []
for index, feature in enumerate(features):
f_df = ta_df[['date', 'team_long_name', feature]]
fc_df = f_df.copy()
fc_df['season'] = fc_df['date'].apply(calc_season) # Create from dates
fc_df = fc_df[['season', 'date', 'team_long_name', feature]] # Reorder columns
fc_df.dropna(how='any', inplace=True) # Not all seasons in data.
fc_df = fc_df.groupby(['season', 'date', 'team_long_name']).max().reset_index()
for year, season in enumerate(seasons):
rank_slim = rank_year[year].filter(['season', 'home_team', 'rank'], axis=1)
f_rank_season = rank_slim.merge(fc_df, left_on='home_team', right_on='team_long_name', how='inner')
f_rank_season = f_rank_season[['season_x', 'team_long_name', feature, 'rank']]
f_rank_season.rename({'season_x':'season'}, axis='columns', inplace=True)
f_rank_season = f_rank_season.drop(f_rank_season[f_rank_season[feature] > 100].index) # Outlier bad data.
f_rank_season = f_rank_season.groupby(['season', 'team_long_name']).max().reset_index()
rsl.append(rank_slim)
frsl.append(f_rank_season)
for year, season in enumerate(seasons):
rank_slim = rank_year[year].filter(['season', 'home_team', 'rank'], axis=1)
f_rank_season = rank_slim.merge(fc_df, left_on='home_team', right_on='team_long_name', how='inner')
f_rank_season = f_rank_season[['season_x', 'team_long_name', feature, 'rank']]
f_rank_season.rename({'season_x':'season'}, axis='columns', inplace=True)
f_rank_season = f_rank_season.drop(f_rank_season[f_rank_season[feature] > 100].index) # Outlier bad data.
f_rank_season = f_rank_season.groupby(['season', 'team_long_name']).max().reset_index()
rsl.append(rank_slim)
frsl.append(f_rank_season)
# Generate a column list for the plots. There are 32 potential plots and only 8 feature names.
features_list = []
for df in frsl:
column = list(df)
features_list.append(column[2])
Scatter plots for all 8 features X 4 seasons are available. If you run more than 20 they open in a separate window. It also very quickly crushes Jupyter Notebook. Avoid this please. To make it "easy" to manage this, this loop prints several plots at one time. Just edit AND RUN the cell below where the plot_list is.
plot_list = range(0, 4) # Will print buildUpPlaySpeed for each season. Edit this list
for value in plot_list:
frsl[value].plot(x=features_list[value], y='rank', figsize=(8, 5),
title=frsl[value].season.iloc[0] + ' Scatter Plot', kind='scatter')
frsl[0].head()
- There is often linear banding that appears at certain places on the x axis. This makes me wonder how valid the data is. That is how rigorous is the data collection and whether or not it is prone to human error or bias. The distribution of the data does not follow a normal pattern.
- There appears to be no features associated with higher rank based on a review of the scatter plots.
I am now looking at the scatter plots for the text features. These are the features that have the word 'Class' in their feature name. They have text instead of numerics. I convert those to a simple integer value and then do a scatter plot on them.
# Deal with text classes.
features_class = ['buildUpPlaySpeedClass',
'buildUpPlayDribblingClass',
'buildUpPlayPassingClass',
'buildUpPlayPositioningClass',
'chanceCreationPassingClass',
'chanceCreationCrossingClass',
'chanceCreationShootingClass',
'chanceCreationPositioningClass',
'defencePressureClass',
'defenceAggressionClass',
'defenceTeamWidthClass',
'defenceDefenderLineClass']
# Needed to calculate dataframe column for classes
def calc_numeric(feature, unique_list):
value_list = list(range(1, len(unique_list)+1))
for index, item in enumerate(unique_list):
if item == feature:
value = value_list[index]
return value
else:
pass
# All seasons for class text columns. For loops that create numeric field and scatter plot
rsl_class = []
frsl_class = []
features_class_list = (features_class) * 4
plot_ready_class_list = []
for index, feature in enumerate(features_class):
f_class_df = ta_df[['date', 'team_long_name', feature]]
fc_class_df = f_class_df.copy()
fc_class_df['season'] = fc_class_df['date'].apply(calc_season) # Create from dates
fc_class_df = fc_class_df[['season', 'date', 'team_long_name', feature]] # Reorder columns
fc_class_df.dropna(how='any', inplace=True) # Not all seasons in data.
unique_list = (fc_class_df[feature].unique().tolist())
fc_class_df[feature +'_num'] = fc_class_df[feature].apply(calc_numeric, args=[unique_list]) # Add num column
for year, season in enumerate(seasons):
rank_class_slim = rank_year[year].filter(['season', 'home_team', 'rank'], axis=1)
f_rank_class_season = rank_class_slim.merge(fc_class_df, left_on='home_team', right_on='team_long_name', how='inner')
f_rank_class_season = f_rank_class_season.filter(['season_x', 'team_long_name', feature, feature + '_num', 'rank'], axis=1)
f_rank_class_season.rename({'season_x':'season'}, axis='columns', inplace=True)
# Group the resuls by season and team_long_name and use max.
# Would need to calculate a weighted average of the values as a percentage of season to use mean.
# Not clear when the data is collected and when it is entered. max is fine to get trends.
f_rank_class_season = f_rank_class_season.groupby(['season', 'team_long_name']).max().reset_index()
rsl_class.append(rank_class_slim)
frsl_class.append(f_rank_class_season)
# Generate a column list for the scatter plots.
features_class_num_list = []
for df in frsl_class:
columns = list(df)
features_class_num_list.append(columns[3])
Scatter plots for all 12 features X 4 seasons are available. If you run more than 20 this quickly crushes Jupyter Notebook. Avoid this please. To make it "easy" to manage this, this loop prints several plots at one time. Just edit AND RUN the cell below where the plot_list is.
print(features_class_num_list)
plot_list = range(8, 12) # Will print buildUpPlaySpeedClass_num for each season. Edit this list.
for value in plot_list:
plot_title = frsl_class[value].season.iloc[0] + " Scatter Plot"
sns.stripplot(x=features_class_num_list[value], y='rank', size=3, data=frsl_class[value],
jitter=True).set_title(plot_title)
plt.show()
The scatter plots were not helpful. Especially the class features converted from text to numeric. Those class features banded vertically along the x axis. Very difficult to derive any information from that. It is not possible to tease information out of the class features by taking a mean over the 4 years. This is because the teams are not static. Players, coaches, etc. move in and out season to season. They are relatively stable within the season, but many changes are made between seasons. A mean over 4 years would be fundamentally misleading. Let's look at bar charts as a way of gaining information from this dataset.
# Rank is taken every season and used to create a quantile bar chart for each feature.
year_rank_quart_dict = {}
for value in range(4):
rank_list = frsl[value]['rank'].quantile([.25, .5, .75, 1]).tolist()
dict_update = {frsl[value].season.iloc[0]: rank_list}
year_rank_quart_dict.update(dict_update)
print(year_rank_quart_dict)
# Calculate the quartile rank for each season.
def calc_quartile(rank, season_key):
quartile_list = year_rank_quart_dict[season_key]
if rank <= quartile_list[0]:
quartile = 1
elif rank <= quartile_list[1]:
quartile = 2
elif rank <= quartile_list[2]:
quartile = 3
else:
quartile = 4
return quartile
# Create the dataframe for a simple bar chart.
bar_list = []
for index, df in enumerate(frsl):
season_key = df.season.iloc[0]
df['quartile'] = df['rank'].apply(calc_quartile, args=[season_key])
df = df[[features_list[index], 'quartile']]
df = df.groupby('quartile').mean().reset_index()
bar_list.append(df)
# Calculate title for bar chart titles
def calc_title(value, feature):
yvl_1112 = range(0, 48, 4)
yvl_1213 = range(1, 48, 4)
yvl_1314 = range(2, 48, 4)
if value in yvl_1112:
return seasons[0]
elif value in yvl_1213:
return seasons[1]
elif value in yvl_1314:
return seasons[2]
else:
return seasons[3]
plot_list = range(0, 4) # Prints buildUpPlaySpeedClass for each season as a bar chart. Edit list.
for value in plot_list:
bar_list[value].plot(x='quartile', label=features_list[value], figsize=(8, 5),
title=calc_title(value, features_list[value]) + ' Simple Bar Chart', kind='bar')
Again, these bar charts were not helpful. The differences are very small between the rank quartiles and each feature.
Continuing on with the simple bar charts for the class features.
# Create the dataframe for a simple bar chart.
bar_class_list = []
for index, df in enumerate(frsl_class):
season_key = df.season.iloc[0]
df['quartile'] = df['rank'].apply(calc_quartile, args=[season_key])
df = df.groupby('quartile').mean().reset_index()
df = df[[features_class_num_list[index], 'quartile']]
bar_class_list.append(df)
plot_list = range(8, 12) # Will print buildUpPlaySpeedClass_num for each season as a bar chart
for value in plot_list:
bar_class_list[value].plot(x='quartile', label=features_list[value], figsize=(8, 5),
title=calc_title(value, features_class_num_list[value]) + ' Simple Bar Chart', kind='bar')
The review of these bar charts often show a very consistent pattern between rank 1 and rank 4 teams per feature. However, the differences are usually very small. To illuminate this better I am going to create a Numeric Share of Difference Bar Charts as below.
The above bar charts did not provide substantial additional information. There are only small differences between personnel and teams at the professional level of soccer. A Numeric Share of Difference Bar Chart should be more instructive.
- Create a bar chart that calculates the difference between the mean of the quartiles. That is of the percentage difference how much is "owned" by each quartile. Plots will open in a separate window.
- The purpose of this bar chart is to highlight the differences between the top quartile and the other quartiles. The patterns that I am looking for are:
- A cascade from the top quartile down to the lowest quartile (or vica versa) and significant differences between them.
- A pattern of the top quartile owning virtually all difference between all of the quartiles, especially as compared to quartile 2.
- I would also like to see that pattern repeated (generally) over all 4 years. If I see these patterns, then there is possibly a correlation between this feature and being top quartile ranked.
- These are the most instructive plots that I have been able to derive.
# Deal with features_list.
print(features_list)
fnsoddfl = [] # Feature Numeric Share Of Difference DataFrame List
for value in range(len(features_list)):
heights = []
df = bar_list[value].copy()
raw_heights = df[features_list[value]].tolist()
for index, height in enumerate(raw_heights):
min_value = min(raw_heights)
max_value = max(raw_heights)
result = (height - min_value + .0001)/(max_value - min_value + .01) * 100
heights.append(result)
df[features_list[value]] = heights
fnsoddfl.append(df)
# Change the plot list to see different feature's bar chart.
plot_list = range(0, 4) # Will print buildUpPlaySpeedClass for each season as a bar chart
for value in plot_list:
title_suffix = ' Percentage Mean Difference With Rank Quartile'
locations = [1, 2, 3, 4]
labels = ('1', '2', '3', '4')
heights = fnsoddfl[value][features_list[value]]
plt.subplots(figsize=(8, 5))
plt.bar(locations, heights, tick_label=labels)
plt.title(calc_title(value, features_list[value]) + title_suffix)
plt.xlabel('Rank Quartile')
plt.ylabel(features_list[value]);
The features that showed the patterns that we were looking for are buildUpPlayPassing, chanceCreationShooting, defenceAggression, and defencePressure.
fcnsoddfl = [] # Feature Class Num Share Of Difference DataFrame List
for value in range(len(features_class_num_list)):
heights = []
df = bar_class_list[value].copy()
raw_heights = df[features_class_num_list[value]].tolist()
for index, height in enumerate(raw_heights):
min_value = min(raw_heights)
max_value = max(raw_heights)
result = (height - min_value + .0001)/(max_value - min_value + .01) * 100
heights.append(result)
df[features_class_num_list[value]] = heights
fcnsoddfl.append(df)
plot_list = range(0, 4) # Will print buildUpPlaySpeedClass_num for each season as a bar chart
for value in plot_list:
title_suffix = ' Percentage Mean Difference With Rank Quartile'
locations = [1, 2, 3, 4]
labels = ('1', '2', '3', '4')
heights = fcnsoddfl[value][features_class_num_list[value]]
plt.subplots(figsize=(8, 5))
plt.bar(locations, heights, tick_label=labels)
plt.title(calc_title(value, features_class_num_list[value]) + title_suffix)
plt.xlabel('Rank Quartile')
plt.ylabel(features_class_num_list[value]);
The features that showed the patterns that we were looking for consistently across all 4 of the seasons were buildUpPlayPassingClass_num and chanceCreationPassingClass_num.
Now I need to understand the correlations, if any, between the most significant variables. The variables that I have chosen based on the preceding percentage of difference bar charts are in the Significant Features List (sfl) below.
sfl = ['buildUpPlayPassing', 'defenceAggression', 'defencePressure', 'chanceCreationShooting',
'buildUpPlayPositioningClass_num', 'chanceCreationPositioningClass_num']
all_features_by_year = (features_list + features_class_num_list)
fid = collections.OrderedDict()
fid = dict(zip(all_features_by_year, range(1, len(all_features_by_year)+1)))
df_years_list = []
sfl_dfs = frsl + frsl_class
df_agg_list = []
# Take all 4 seasons of data, trim it, and put it into one list per signficant feature.
for value in sfl:
fid_value = value
for value in range(fid[fid_value] - 4, fid[fid_value]):
df = sfl_dfs[value][['team_long_name', fid_value, 'quartile']]
df = df[df['quartile'] == 1]
df_years_list.append(df)
# Append the 4 files together.
for i in range(len(sfl)):
fyl_index = range(0, len(sfl*4), 4)
offset = fyl_index[i]
df = df_years_list[offset].append(df_years_list[offset + 1], ignore_index=True)
df1 = df.append(df_years_list[offset + 2], ignore_index=True)
df2 = df1.append(df_years_list[offset + 3], ignore_index=True)
df2 = df2.drop(['quartile'], axis=1)
df_agg_list.append(df2)
col_list = range(len(sfl))
row_list = range(len(sfl))
# Merged pair list
mpl = []
pair = (0, 0)
for col in col_list:
for row in row_list:
if col == row:
pass
elif (row, col) in mpl or (col, row) in mpl:
pass
else:
pair = (col, row)
mpl.append(pair)
# Merge the dataframes into pairs for scatter plot purposes to test correlation.
# NB IF you run this once and you want to run it again, you need to go back 4 cells and start
# there otherwise it will fail.
merged_features = []
for item in mpl:
left_side, right_side = item
df = df_agg_list[left_side].merge(df_agg_list[right_side], left_index=True, right_index=True, how='left')
df.drop(['team_long_name_y'], axis=1, inplace=True)
df.rename(columns={'team_long_name_x':'team_long_name'}, inplace=True)
merged_features.append(df)
plot_title = 'Top Rank Quartile Correlation For 2011-2015 Seasons'
df.plot(x=sfl[left_side], y=sfl[right_side],
figsize=(8, 5), title=plot_title, kind='scatter')
for index, df in enumerate(merged_features):
if index == 5:
df.plot(x='defenceAggression', y='defencePressure', figsize=(8, 5),
title='Top Rank Quartile Correlation For 2011-2015 Seasons', kind='scatter')
elif index == 6:
df.plot(x='chanceCreationShooting', y='defenceAggression', figsize=(8, 5),
title='Top Rank Quartile Correlation For 2011-2015 Seasons', kind='scatter')
elif index == 9:
df.plot(x='chanceCreationShooting', y='defencePressure', figsize=(8, 5),
title='Top Rank Quartile Correlation For 2011-2015 Seasons', kind='scatter')
else:
pass
Run a Spearman correlation on these features to see if they are correlated.
merged_features[5].drop(['team_long_name'], axis=1).corr(method='spearman')
merged_features[6].drop(['team_long_name'], axis=1).corr(method='spearman')
merged_features[9].drop(['team_long_name'], axis=1).corr(method='spearman')
The above pairs do exhibit some level of correlation. Especially the pair defencePressure and chanceCreationShooting.
I can also look at the correlations between ALL features and rank by utilizing Spearman and Pearson correlations. To feed these algorithms I need to merge the Team_Attributes table with the Team Rank. Since I want to answer the queston how do the attributes affect rank I will take the rank EACH season for each team and combine that with the attributes. I will also create the rank quartile so that the results can be compared with the above scatter and bar chart approach.
df = ta_df.copy()
df['season'] = df['date'].apply(calc_season) # Create from dates
df.dropna(how='any', inplace=True) # Not all seasons in data.
df = df.groupby(['season', 'date', 'team_long_name']).max().reset_index()
for feature in features_class:
unique_list = (df[feature].unique().tolist())
df[feature +'_num'] = df[feature].apply(calc_numeric, args=[unique_list]) # Add num column
df = df.merge(slim_df, left_on='team_long_name', right_on='home_team_1112', how='inner')
season_key = df.season.iloc[0]
df['quartile'] = df['rank_1415'].apply(calc_quartile, args=[season_key])
df = df[['team_long_name', 'quartile', 'buildUpPlaySpeed',
'chanceCreationCrossing', 'chanceCreationShooting', 'defencePressure',
'defenceAggression', 'defenceTeamWidth', 'buildUpPlayPassingClass_num',
'buildUpPlayPositioningClass_num', 'chanceCreationPassingClass_num',
'chanceCreationCrossingClass_num', 'chanceCreationShootingClass_num',
'chanceCreationPositioningClass_num', 'defenceTeamWidthClass_num']]
df = df.groupby('team_long_name').mean().reset_index()
# Please run this cell. For some reason it does not always run.
df.corr(method='spearman')
# Please run this cell. For some reason it does not always run.
df.corr(method='pearson').style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)
The dependent variable is quartile rank. A coefficient of close to 1 or -1 indicates a strong correlation. This data was not correlated as shown by the Spearman and Pearson coefficients below.
- In the My Analysis column there are no correlation coefficients. I was simply looking at scatter plots and bar charts. The features that are shown in the Spearman and Pearson columns are the MOST correlated features. As you notice they do not even match My Analysis choices. Even their correlations are VERY low. This does not surprise me. When I looked at the initial scatter plots, there were no obvious correlations.
- I produced over 140 scatter and bar charts in the above analysis. The features (attributes) that appeared to have the largest correlation to quartile ranking were as above (My Analysis first column in above table). I did an analysis of that result using Spearman and Pearson statistical analysis. The net is while it appears that the 140 plus plots hint at correlations and possible relationships, the Spearman and Pearson analysis make that quite unlikely. I would draw no conclusions from this data/analysis.
Since there is no requirement to see if team attributes affect win percentage of the team at home, I simply took the entire dataframe and added the appropriate columns and calculations.
def calc_win(values):
# Values is being passed as a list.
home_team_goal, away_team_goal = values
if home_team_goal > away_team_goal:
result = 1
else:
result = 0
return result
def calc_draw(values):
# Values is being passed as a list.
home_team_goal, away_team_goal = values
if away_team_goal == away_team_goal:
result = 1
else:
result = 0
return result
def calc_loss(values):
# Values is being passed as a list.
home_team_goal, away_team_goal = values
if home_team_goal < away_team_goal:
result = 1
else:
result = 0
return result
df = matches_df.copy()
df = df[['home_team', 'home_team_goal', 'away_team_goal']]
df['wins'] = matches_df[['home_team_goal', 'away_team_goal']].apply(calc_win, axis=1)
df['draws'] = df[['home_team_goal', 'away_team_goal']].apply(calc_draw, axis=1)
df['losses'] = df[['home_team_goal', 'away_team_goal']].apply(calc_loss, axis=1)
df = df.groupby('home_team').sum()
df['winning%'] = df[['wins', 'draws', 'losses']].apply(lambda x: (x.wins/(x.wins + x.draws + x.losses)) * 100, axis=1)
df['draw%'] = df[['wins', 'draws', 'losses']].apply(lambda x: (x.draws/(x.wins + x.draws + x.losses)) * 100, axis=1)
df['loss%'] = df[['wins', 'draws', 'losses']].apply(lambda x: (x.losses/(x.wins + x.draws + x.losses)) * 100, axis=1)
df = df.sort_values('winning%', ascending=False)
win_home_team_df = df[['wins', 'draws', 'losses', 'winning%', 'draw%', 'loss%']]
win_home_team_df.head(10)
This simple table answers the question perfectly. No need to have a visualization.
Limitations¶
- Study Design The dataset is impressive and holds a lot of information that is very useful for evaluating betting strategies. However, the original data, that the Kaggle data was compiled from, had a Player_Atrributes table that has extensive features on who played for what team in which matches. The margin of victory is very narrow in soccer. A top striker would likely have a huge impact on the rank of a team. Witness the above draw% number. Only one goal more or less and this result changes from a draw to a win or loss. It is likely that combining the Player Attributes with the Match and the Team Attributes would yield the most effective answer to the question posed for Q2 "What team attributes lead to the highest rank?" With that additional data the question would be posed as What team AND player attributes lead to the highest rank?
- Lack of available and/or reliable data While the data is hoped to be reliable there were anomalies in it that clearly pointed to a less than rigorous method of collecting and entering the data. On a number of features (e.g. buildupPlaySpeed) there was a large cluster of numeric results at 70 on the x axis. This is not a normal distribution. More background on the data collection and the veracity of the data would be helpful.
- Lack of prior research studies on the topic The design of the study was to demonstrate abilities with using pandas, numpy, and minor statistical techniques. It was not to garner information in the field and produce a scholarly report.
Summary¶
- I was able to answer Q1 and Q3 quite effectively. That is who has improved the most and who has the best home record.
- However, for Q2, despite providing automation tools to look at scatter plots, bar charts (two types) and over 140 plots in total; there was no conclusive evidence that any of these features are correlated to quartile rank of the teams in the European Soccer Database.