Contact Information

  • Lindsay Moir
  • 778 679 4406
  • tragoes@gmail.com

Introduction

I am using the European soccer database available from Kaggle. The questions that I am asking are as follows.

  1. What teams improved the most?
  2. What team attributes lead to the highest rank?
  3. Who are the best home teams in terms of % wins at home?

System Environment

Methodology

The Match data ranges from 2008 to 2016 while the Team Attribute data ONLY goes from 2011 to 2015. To come up with a rank I need to do it by season. As a result, I will only be reviewing 4 seasons worth of data as follows.

image.png

There are potentially 21 independent variables to be ranked against the one dependent variable which is rank improvement. The methodology is to:

  • Compare each Team_Attribute against the ranks of the teams in THAT SEASON. Teams change radically year to year. It makes no sense to take a mean over a number of years and compare it to the final rank.
  • I will use scatter plots to initially see if there are any correlations. If there are strong correlations for several features I will choose that subset of features to look at bar charts of those features against quartile rank. If there are few correlations revealed in the scatter plots I will run all of the features with bar charts.
  • If the bar charts are sufficiently revealing I will choose 4 features (attributes) and run correlations between those pairs of attributes to see if they are related.
  • If those bar charts are not helpful, I will run bar charts that are based on the difference between the max and min for each quartile and create a bar chart that shows which quartile owns what percentage of that difference.
  • Finally, to ensure that I have revealed statistically valid relationships I will run Spearman and Pearson algorithms against the 4 seasons to see if they agree with my results.
  • I will need automation to do the above. There are 20 features with 4 seasons each. There are 4 different sets of charts (scatter, bar, bar as a share of difference, and covariance relationship between features). Altogether there are roughly 140 charts that you can generate with this notebook.

Attribution

  • I made extensive use of stackoverflow (https://stackoverflow.com). However, there is no carte blanche copying of code in this notebook.
  • For the Pearson and Spearman statistical analysis I used this source. https://www.datascience.com/learn-data-science/fundamentals/introduction-to-correlation-python-data-science Pearson measures the linear relationship between continuous variables. Spearman's rank correlation coefficient can be defined as a special case of Pearson ρ applied to ranked (sorted) variables. According to Introduction to Correlation by Ruslana Dalinina 01.31.17 'Unlike Pearson, Spearman's correlation is not restricted to linear relationships. It measures monotonic association (only strictly increasing or decreasing, but not mixed) between two variables and relies on the rank order of values. In other words, rather than comparing means and variances, Spearman's coefficient looks at the relative order of values for each variable. This makes it appropriate to use with both continuous and discrete data.'
  • I used code from this URL on Kaggle https://www.kaggle.com/dimarudov/data-analysis-using-sql to help me understand how to access the SQLite database.

Instructions For Running This Jupyter Notebook

  • I recommend that you do not enter the notebook and use "Run All" cells. There are a large number of plots to be ran. This will crush the kernel and cause subsequent cells to not run. It is best to step thru the notebook and run each cell in order. Otherwise some cells will fail and it will be confusing.
  • While Jupyter Notebooks are an incredibly productive system for iterating on code, they tend to encourage less than great programming habits. Specifically name space issues. If you find that some of the code is failing, in most cases, you will only need to go to the top of the section that you are in and simply run that section again. If that does not work, look at the preceding cells until you find where match_df or team_da are referenced. Then run the notebook from there.
  • I have also provided a large number of plots for the reader to review. I have provided a simple range function at the front of most of these plots. Just edit the numbers to see the plots. Usually they will be in groups of 4. So, range(4, 8) will get you the second feature and the years from 2011-2015.
In [1]:
import numpy as np
import pandas as pd
import sqlite3
import collections
import matplotlib.pyplot as plt
%matplotlib qt
import seaborn as sns
sns.set_style('darkgrid')

Data Wrangling

  • I did the usual process of reviewing the data, looking for NaNs, deduping, running histograms, etc. Since I invested so much time automating plot creation I became quite familiar with the data. It is a very clean dataset.
  • There was a huge amount of work that went into understanding this data. In order to make this notebook understandable much of that code is not provided here. I can provide the earlier iterations of the notebook for you to peruse if you wish.

General Properties

I am only looking at the Matches and Team_Attributes table. The sql database does not include who is playing for what team. So, the Player_Attributes table is impossible to link to team outcomes. The Matches table has data from July 18, 2008 and ends in May 25, 2016 The Team attributes data does not start until February 22, 2010 and only goes until Sep 10, 2015. If I want full seasons to match our data that means I am restricted to July 1, 2011 to June 30, 2015. This is only 4 years. That cuts our data roughly in half from ~ 25,000 rows to ~ 14,000 rows for Matches (for example).

In [2]:
# Connect to the database

database = 'database.sqlite'
conn = sqlite3.connect(database)
tables = pd.read_sql("""SELECT *
                        FROM sqlite_master
                        WHERE type='table';""", conn)
tables
Out[2]:
type name tbl_name rootpage sql
0 table sqlite_sequence sqlite_sequence 4 CREATE TABLE sqlite_sequence(name,seq)
1 table Player_Attributes Player_Attributes 11 CREATE TABLE "Player_Attributes" (\n\t`id`\tIN...
2 table Player Player 14 CREATE TABLE `Player` (\n\t`id`\tINTEGER PRIMA...
3 table Match Match 18 CREATE TABLE `Match` (\n\t`id`\tINTEGER PRIMAR...
4 table League League 24 CREATE TABLE `League` (\n\t`id`\tINTEGER PRIMA...
5 table Country Country 26 CREATE TABLE `Country` (\n\t`id`\tINTEGER PRIM...
6 table Team Team 29 CREATE TABLE "Team" (\n\t`id`\tINTEGER PRIMARY...
7 table Team_Attributes Team_Attributes 2 CREATE TABLE `Team_Attributes` (\n\t`id`\tINTE...
In [3]:
# Base table for understanding answer to Q1. How have the teams progressed/digressed over the years.

matches_df = pd.read_sql("""SELECT Match.id, 
                                Country.name AS country_name, 
                                League.name AS league_name, 
                                season, 
                                stage, 
                                date,
                                HT.team_long_name AS  home_team,
                                AT.team_long_name AS away_team,
                                home_team_goal, 
                                away_team_goal                                        
                            FROM Match
                            JOIN Country on Country.id = Match.country_id
                            JOIN League on League.id = Match.league_id
                            LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
                            LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
                            ORDER by date; 
                            """, conn)

matches_df.head()
Out[3]:
id country_name league_name season stage date home_team away_team home_team_goal away_team_goal
0 24559 Switzerland Switzerland Super League 2008/2009 1 2008-07-18 00:00:00 BSC Young Boys FC Basel 1 2
1 24560 Switzerland Switzerland Super League 2008/2009 1 2008-07-19 00:00:00 FC Aarau FC Sion 3 1
2 24561 Switzerland Switzerland Super League 2008/2009 1 2008-07-20 00:00:00 FC Luzern FC Vaduz 1 2
3 24562 Switzerland Switzerland Super League 2008/2009 1 2008-07-20 00:00:00 Neuchâtel Xamax FC Zürich 1 2
4 24613 Switzerland Switzerland Super League 2008/2009 2 2008-07-23 00:00:00 FC Basel Grasshopper Club Zürich 1 0

Data Cleaning

I went thru the typical process of looking for nulls (NaNs), duplicates, values out of range, counting unique elements and looking at them etc. The data was quite clean. I only dropped one column in the Team Attributes table due to NaNs. The biggest challenge is that there are 21 features in this dataset and it is quite time consuming to understand which features are important and which are not. I built automated processes to do that.

In [4]:
matches_df.shape
Out[4]:
(25979, 10)
In [5]:
matches_df.describe()
Out[5]:
id stage home_team_goal away_team_goal
count 25979.000000 25979.000000 25979.000000 25979.000000
mean 12990.000000 18.242773 1.544594 1.160938
std 7499.635658 10.407354 1.297158 1.142110
min 1.000000 1.000000 0.000000 0.000000
25% 6495.500000 9.000000 1.000000 0.000000
50% 12990.000000 18.000000 1.000000 1.000000
75% 19484.500000 27.000000 2.000000 2.000000
max 25979.000000 38.000000 10.000000 9.000000
In [6]:
# Histograms for home_team_goal and away_team_goal
plt.figure(figsize=(8, 5))
plt.xlabel('Goals')
plt.ylabel('Matches')
plt.hist(matches_df['home_team_goal'], alpha=.5, label='home_team_goal')
plt.hist(matches_df['away_team_goal'], alpha=.5, label='away_team_goal')
plt.title('Histogram of Match Table Numeric Features')
plt.legend();

As expected home teams score more goals than away teams!

In [7]:
matches_df.isnull().sum()
Out[7]:
id                0
country_name      0
league_name       0
season            0
stage             0
date              0
home_team         0
away_team         0
home_team_goal    0
away_team_goal    0
dtype: int64

The above is great! Not a single null (NaN)!

In [8]:
# Check for duplicates
matches_df.duplicated().sum()
Out[8]:
0
In [9]:
# Check for bad values on season. All are good.
matches_df['season'].unique()
Out[9]:
array(['2008/2009', '2009/2010', '2010/2011', '2011/2012', '2012/2013',
       '2013/2014', '2014/2015', '2015/2016'], dtype=object)
In [10]:
matches_df['stage'].unique()
Out[10]:
array([ 1,  2,  3,  4,  6,  5,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38], dtype=int64)

The Match table columns that I am using are very clean. No nulls (NaNs), duplicates, or funny values. The only issue is the lack of matching data in the Team_Attributes table.

In [11]:
ta_df = pd.read_sql("""SELECT Team.team_long_name, Team_Attributes.* 
                                FROM Team_Attributes
                                JOIN Team
                                    ON Team.team_api_id = Team_Attributes.team_api_id 
                                ORDER BY Team.team_long_name
                                ;""", conn)
ta_df.head()
Out[11]:
team_long_name id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1. FC Kaiserslautern 631 29 8350 2010-02-22 00:00:00 45 Balanced NaN Little 55 ... 70 Lots Organised 70 High 70 Double 70 Wide Cover
1 1. FC Kaiserslautern 632 29 8350 2011-02-22 00:00:00 48 Balanced NaN Little 42 ... 53 Normal Organised 38 Medium 48 Press 53 Normal Cover
2 1. FC Kaiserslautern 633 29 8350 2012-02-22 00:00:00 38 Balanced NaN Little 54 ... 64 Normal Organised 37 Medium 51 Press 53 Normal Cover
3 1. FC Kaiserslautern 634 29 8350 2013-09-20 00:00:00 61 Balanced NaN Little 54 ... 56 Normal Organised 48 Medium 50 Press 53 Normal Cover
4 1. FC Kaiserslautern 635 29 8350 2014-09-19 00:00:00 66 Balanced 48.0 Normal 53 ... 51 Normal Organised 47 Medium 47 Press 49 Normal Cover

5 rows × 26 columns

In [12]:
ta_df.describe()
Out[12]:
id team_fifa_api_id team_api_id buildUpPlaySpeed buildUpPlayDribbling buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth
count 1458.000000 1458.000000 1458.000000 1458.000000 489.000000 1458.000000 1458.000000 1458.000000 1458.000000 1458.000000 1458.000000 1458.000000
mean 729.500000 17706.982167 9995.727023 52.462277 48.607362 48.490398 52.165295 53.731824 53.969136 46.017147 49.251029 52.185871
std 421.032659 39179.857739 13264.869900 11.545869 9.678290 10.896101 10.360793 11.086796 10.327566 10.227225 9.738028 9.574712
min 1.000000 1.000000 1601.000000 20.000000 24.000000 20.000000 21.000000 20.000000 22.000000 23.000000 24.000000 29.000000
25% 365.250000 110.000000 8457.750000 45.000000 42.000000 40.000000 46.000000 47.000000 48.000000 39.000000 44.000000 47.000000
50% 729.500000 485.000000 8674.000000 52.000000 49.000000 50.000000 52.000000 53.000000 53.000000 45.000000 48.000000 52.000000
75% 1093.750000 1900.000000 9904.000000 62.000000 55.000000 55.000000 59.000000 62.000000 61.000000 51.000000 55.000000 58.000000
max 1458.000000 112513.000000 274581.000000 80.000000 77.000000 80.000000 80.000000 80.000000 80.000000 72.000000 72.000000 73.000000
In [13]:
# no text fields, numeric only
features = ['buildUpPlaySpeed',
            'buildUpPlayPassing',
            'chanceCreationPassing',
            'chanceCreationCrossing',
            'chanceCreationShooting',
            'defencePressure',
            'defenceAggression',
            'defenceTeamWidth'] 
In [14]:
# Histograms for Team_Attributes
for feature in features:
    plt.figure(figsize=(8, 5))
    plt.xlabel('Value')
    plt.ylabel('Counts')
    plt.hist(ta_df[feature], alpha=.5, label=feature)
    plt.title('Histogram of ' + feature + ' From Team Attributes Table')
    plt.legend();

There are very large concentrations of data points in the centres of most of the ta_df histograms. These histograms are directly derived from the Team Attributes table.

In [15]:
ta_df.isnull().sum()
Out[15]:
team_long_name                      0
id                                  0
team_fifa_api_id                    0
team_api_id                         0
date                                0
buildUpPlaySpeed                    0
buildUpPlaySpeedClass               0
buildUpPlayDribbling              969
buildUpPlayDribblingClass           0
buildUpPlayPassing                  0
buildUpPlayPassingClass             0
buildUpPlayPositioningClass         0
chanceCreationPassing               0
chanceCreationPassingClass          0
chanceCreationCrossing              0
chanceCreationCrossingClass         0
chanceCreationShooting              0
chanceCreationShootingClass         0
chanceCreationPositioningClass      0
defencePressure                     0
defencePressureClass                0
defenceAggression                   0
defenceAggressionClass              0
defenceTeamWidth                    0
defenceTeamWidthClass               0
defenceDefenderLineClass            0
dtype: int64

With buildUpPlayDribbling having 969 NaNs in a total of 1458 rows, it makes no sense to keep this column. This attribute is still captured by buildUpPlayDribblingClass.

In [16]:
ta_df.drop(columns='buildUpPlayDribbling', inplace=True)
ta_df.head()
Out[16]:
team_long_name id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1. FC Kaiserslautern 631 29 8350 2010-02-22 00:00:00 45 Balanced Little 55 Mixed ... 70 Lots Organised 70 High 70 Double 70 Wide Cover
1 1. FC Kaiserslautern 632 29 8350 2011-02-22 00:00:00 48 Balanced Little 42 Mixed ... 53 Normal Organised 38 Medium 48 Press 53 Normal Cover
2 1. FC Kaiserslautern 633 29 8350 2012-02-22 00:00:00 38 Balanced Little 54 Mixed ... 64 Normal Organised 37 Medium 51 Press 53 Normal Cover
3 1. FC Kaiserslautern 634 29 8350 2013-09-20 00:00:00 61 Balanced Little 54 Mixed ... 56 Normal Organised 48 Medium 50 Press 53 Normal Cover
4 1. FC Kaiserslautern 635 29 8350 2014-09-19 00:00:00 66 Balanced Normal 53 Mixed ... 51 Normal Organised 47 Medium 47 Press 49 Normal Cover

5 rows × 25 columns

In [17]:
# Check for duplicates
ta_df.duplicated().sum()
Out[17]:
0

I am fortunate, there are no real major data issues. The biggest challenge is that I lose half our data due to needing to compare attributes to matches. I now only have 4 seasons of data.

Exploratory Data Analysis

  • We will be starting with Qestion 1 which requires us to come up with relative rank for each team comparing the 2011/2012 season rank to the 2014/2015 rank and seeing which teams have the largest relative movement.
  • From there we will answer Question 2 (correlations to rank), and finally Question 3, which team has the best home record in terms of wins.
  • This will be an iterative process as we see what comes out from each step.

Q1 What teams improved the most?

To answer this I created a team record for each season (wins, draws, and losses). Then I ranked the team in 2011/2012 and then in 2014/2015. This is the time period that I have overlapping data from the Team_Attributes table. Then I produced a rank difference.

In [18]:
# All complete season points
df = matches_df.copy()
df['points'] = df[['home_team_goal', 'away_team_goal']].apply(lambda row: 2 if row[0] > row[1] else (1 if row[0] == row[1] else 0), axis=1)
df = df.sort_values(['season', 'points'], ascending=[True, False])
df = df.groupby(['season', 'home_team']).sum().reset_index()
df = df[['season', 'home_team', 'points']]
In [19]:
df.head(2)
Out[19]:
season home_team points
0 2008/2009 1. FC Köln 13
1 2008/2009 AC Bellinzona 20
In [20]:
# Create ranks for each season. 
seasons = df.season.unique().tolist()
seasons
Out[20]:
['2008/2009',
 '2009/2010',
 '2010/2011',
 '2011/2012',
 '2012/2013',
 '2013/2014',
 '2014/2015',
 '2015/2016']
In [21]:
# Only want overlapping data with Team_Attributres table
seasons = seasons[3:7]
seasons
Out[21]:
['2011/2012', '2012/2013', '2013/2014', '2014/2015']
In [22]:
# Input is matches_df_gb and seasons.
rank_year = []
for season in seasons:
        year = df.loc[df['season'] == season]
        year.insert(3, 'rank', range(1, 1 + len(year)))
        rank_year.append(year)
In [23]:
# Relative Rank between 2011/2012 season and 2014/2015 season.
# rename 2011/2012 columns
mdf_1112 = rank_year[0].copy()
mdf_1112.rename(columns=lambda x: x[:15] + "_1112", inplace=True)
mdf_1415 = rank_year[3].copy()
mdf_1415.rename(columns=lambda x: x[:15] + "_1415", inplace=True)
df_combined = mdf_1112.merge(mdf_1415, left_on='home_team_1112', right_on='home_team_1415', how='inner')
slim_df = df_combined[['home_team_1112', 'rank_1112', 'rank_1415']]
sorted_slim_df = slim_df.sort_values('home_team_1112')
sorted_slim_df['rank_difference'] = sorted_slim_df['rank_1112'] - sorted_slim_df['rank_1415']
sorted_slim_df = sorted_slim_df.sort_values('rank_difference', ascending=False).head(10)
sorted_slim_df.columns = ['team_name', '2011_2012_rank', '2014_2015_rank', 'rank_difference']

A1 Top 10 Most Improved Teams

In [24]:
sorted_slim_df.head(10)
Out[24]:
team_name 2011_2012_rank 2014_2015_rank rank_difference
16 Borussia Mönchengladbach 26 19 7
25 Cracovia 37 30 7
24 Club Brugge KV 36 29 7
23 Chievo Verona 35 28 7
22 Chelsea 34 27 7
21 Cesena 33 26 7
20 Celtic 32 25 7
17 CD Nacional 28 21 7
15 Borussia Dortmund 25 18 7
26 Dundee United 40 34 6

This simple table answers the question perfectly. No need to have a visualization.

Q2 What team attributes lead to the highest rank?

Our definition of leading to the highest rank is for the seasons 2011-2015 (4 seasons) as detailed earlier. The dependent variable is the rank and in this case the quartile rank. I am looking for those attributes (features) that appear to correlate to that higher rank. There are 20 features in this table that we are using. In order for us to understand which ones are important will require us to look at a lot of plots in an automated fashion. Not only are there many features, there are also 4 seasons for each feature AND 12 of the features are text and in order to evaluate them I need to convert those text fields to a numeric value.

In [25]:
ta_df.head()
Out[25]:
team_long_name id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1. FC Kaiserslautern 631 29 8350 2010-02-22 00:00:00 45 Balanced Little 55 Mixed ... 70 Lots Organised 70 High 70 Double 70 Wide Cover
1 1. FC Kaiserslautern 632 29 8350 2011-02-22 00:00:00 48 Balanced Little 42 Mixed ... 53 Normal Organised 38 Medium 48 Press 53 Normal Cover
2 1. FC Kaiserslautern 633 29 8350 2012-02-22 00:00:00 38 Balanced Little 54 Mixed ... 64 Normal Organised 37 Medium 51 Press 53 Normal Cover
3 1. FC Kaiserslautern 634 29 8350 2013-09-20 00:00:00 61 Balanced Little 54 Mixed ... 56 Normal Organised 48 Medium 50 Press 53 Normal Cover
4 1. FC Kaiserslautern 635 29 8350 2014-09-19 00:00:00 66 Balanced Normal 53 Mixed ... 51 Normal Organised 47 Medium 47 Press 49 Normal Cover

5 rows × 25 columns

Numeric Scatter Plots

I want to see if there are any obvious correlations between rank and features using scatter plots.

In [26]:
ta_df.head(3)
Out[26]:
team_long_name id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1. FC Kaiserslautern 631 29 8350 2010-02-22 00:00:00 45 Balanced Little 55 Mixed ... 70 Lots Organised 70 High 70 Double 70 Wide Cover
1 1. FC Kaiserslautern 632 29 8350 2011-02-22 00:00:00 48 Balanced Little 42 Mixed ... 53 Normal Organised 38 Medium 48 Press 53 Normal Cover
2 1. FC Kaiserslautern 633 29 8350 2012-02-22 00:00:00 38 Balanced Little 54 Mixed ... 64 Normal Organised 37 Medium 51 Press 53 Normal Cover

3 rows × 25 columns

In [27]:
def calc_season(date_value):
    # Calculate the season for the row
    date = date_value.split()
    ymd = date[0].replace('-', '')
    ymd = int(ymd)
    if ymd >= 20110701 and ymd <20120701:
        season = '2011/2012'
    elif ymd <20130701:
        season = '2012/2013'
    elif ymd <20140701:
        season = '2013/2014'
    elif ymd <20150701:
        season = '2014/2015'
    else:
        season = None
    return season
In [28]:
rsl = []
frsl = []

for index, feature in enumerate(features):
    f_df = ta_df[['date', 'team_long_name', feature]]
    fc_df = f_df.copy()
    fc_df['season'] = fc_df['date'].apply(calc_season) # Create from dates
    fc_df = fc_df[['season', 'date', 'team_long_name', feature]] # Reorder columns
    fc_df.dropna(how='any', inplace=True) # Not all seasons in data.
    fc_df = fc_df.groupby(['season', 'date', 'team_long_name']).max().reset_index()
    
    for year, season in enumerate(seasons):
        rank_slim = rank_year[year].filter(['season', 'home_team', 'rank'], axis=1)
        f_rank_season = rank_slim.merge(fc_df, left_on='home_team', right_on='team_long_name', how='inner')
        f_rank_season = f_rank_season[['season_x', 'team_long_name', feature, 'rank']]
        f_rank_season.rename({'season_x':'season'}, axis='columns', inplace=True)
        f_rank_season = f_rank_season.drop(f_rank_season[f_rank_season[feature] > 100].index) # Outlier bad data.
        f_rank_season = f_rank_season.groupby(['season', 'team_long_name']).max().reset_index()
        rsl.append(rank_slim)
        frsl.append(f_rank_season)
In [29]:
for year, season in enumerate(seasons):
        rank_slim = rank_year[year].filter(['season', 'home_team', 'rank'], axis=1)
        f_rank_season = rank_slim.merge(fc_df, left_on='home_team', right_on='team_long_name', how='inner')
        f_rank_season = f_rank_season[['season_x', 'team_long_name', feature, 'rank']]
        f_rank_season.rename({'season_x':'season'}, axis='columns', inplace=True)
        f_rank_season = f_rank_season.drop(f_rank_season[f_rank_season[feature] > 100].index) # Outlier bad data.
        f_rank_season = f_rank_season.groupby(['season', 'team_long_name']).max().reset_index()
        rsl.append(rank_slim)
        frsl.append(f_rank_season)
In [30]:
# Generate a column list for the plots. There are 32 potential plots and only 8 feature names.
features_list = []
for df in frsl:
    column = list(df)
    features_list.append(column[2])

Scatter plots for all 8 features X 4 seasons are available. If you run more than 20 they open in a separate window. It also very quickly crushes Jupyter Notebook. Avoid this please. To make it "easy" to manage this, this loop prints several plots at one time. Just edit AND RUN the cell below where the plot_list is.

In [31]:
plot_list = range(0, 4) # Will print buildUpPlaySpeed for each season. Edit this list
for value in plot_list:
    frsl[value].plot(x=features_list[value], y='rank', figsize=(8, 5), 
             title=frsl[value].season.iloc[0] + ' Scatter Plot', kind='scatter')
In [32]:
frsl[0].head()
Out[32]:
season team_long_name buildUpPlaySpeed rank
0 2011/2012 1. FC Kaiserslautern 66 1
1 2011/2012 1. FC Köln 60 2
2 2011/2012 1. FC Nürnberg 46 3
3 2011/2012 1. FSV Mainz 05 68 4
4 2011/2012 AC Ajaccio 65 5
  • There is often linear banding that appears at certain places on the x axis. This makes me wonder how valid the data is. That is how rigorous is the data collection and whether or not it is prone to human error or bias. The distribution of the data does not follow a normal pattern.
  • There appears to be no features associated with higher rank based on a review of the scatter plots.

Class (Text) Converted to Numeric Scatter Plots

I am now looking at the scatter plots for the text features. These are the features that have the word 'Class' in their feature name. They have text instead of numerics. I convert those to a simple integer value and then do a scatter plot on them.

In [33]:
# Deal with text classes.
features_class = ['buildUpPlaySpeedClass',
                 'buildUpPlayDribblingClass',
                 'buildUpPlayPassingClass',
                 'buildUpPlayPositioningClass',
                 'chanceCreationPassingClass',
                 'chanceCreationCrossingClass',
                 'chanceCreationShootingClass',
                 'chanceCreationPositioningClass',
                 'defencePressureClass',
                 'defenceAggressionClass',
                 'defenceTeamWidthClass',
                 'defenceDefenderLineClass']
In [34]:
# Needed to calculate dataframe column for classes 
def calc_numeric(feature, unique_list):
    value_list = list(range(1, len(unique_list)+1))
    for index, item in enumerate(unique_list):
        if item == feature:
            value = value_list[index]
            return value
        else:
            pass
In [35]:
# All seasons for class text columns. For loops that create numeric field and scatter plot

rsl_class = []
frsl_class = []
features_class_list = (features_class) * 4
plot_ready_class_list = []

for index, feature in enumerate(features_class):
    f_class_df = ta_df[['date', 'team_long_name', feature]]
    fc_class_df = f_class_df.copy()
    fc_class_df['season'] = fc_class_df['date'].apply(calc_season) # Create from dates
    fc_class_df = fc_class_df[['season', 'date', 'team_long_name', feature]] # Reorder columns
    fc_class_df.dropna(how='any', inplace=True) # Not all seasons in data.
    unique_list = (fc_class_df[feature].unique().tolist())
    fc_class_df[feature +'_num'] = fc_class_df[feature].apply(calc_numeric, args=[unique_list]) # Add num column
    
    for year, season in enumerate(seasons):
        rank_class_slim = rank_year[year].filter(['season', 'home_team', 'rank'], axis=1)
        f_rank_class_season = rank_class_slim.merge(fc_class_df, left_on='home_team', right_on='team_long_name', how='inner')
        f_rank_class_season = f_rank_class_season.filter(['season_x', 'team_long_name', feature, feature + '_num', 'rank'], axis=1)
        f_rank_class_season.rename({'season_x':'season'}, axis='columns', inplace=True)
        # Group the resuls by season and team_long_name and use max.
        # Would need to calculate a weighted average of the values as a percentage of season to use mean.
        # Not clear when the data is collected and when it is entered. max is fine to get trends.
        f_rank_class_season = f_rank_class_season.groupby(['season', 'team_long_name']).max().reset_index()
        rsl_class.append(rank_class_slim)
        frsl_class.append(f_rank_class_season)
In [36]:
# Generate a column list for the scatter plots.
features_class_num_list = []
for df in frsl_class:
    columns = list(df)
    features_class_num_list.append(columns[3])

Scatter plots for all 12 features X 4 seasons are available. If you run more than 20 this quickly crushes Jupyter Notebook. Avoid this please. To make it "easy" to manage this, this loop prints several plots at one time. Just edit AND RUN the cell below where the plot_list is.

In [37]:
print(features_class_num_list)
['buildUpPlaySpeedClass_num', 'buildUpPlaySpeedClass_num', 'buildUpPlaySpeedClass_num', 'buildUpPlaySpeedClass_num', 'buildUpPlayDribblingClass_num', 'buildUpPlayDribblingClass_num', 'buildUpPlayDribblingClass_num', 'buildUpPlayDribblingClass_num', 'buildUpPlayPassingClass_num', 'buildUpPlayPassingClass_num', 'buildUpPlayPassingClass_num', 'buildUpPlayPassingClass_num', 'buildUpPlayPositioningClass_num', 'buildUpPlayPositioningClass_num', 'buildUpPlayPositioningClass_num', 'buildUpPlayPositioningClass_num', 'chanceCreationPassingClass_num', 'chanceCreationPassingClass_num', 'chanceCreationPassingClass_num', 'chanceCreationPassingClass_num', 'chanceCreationCrossingClass_num', 'chanceCreationCrossingClass_num', 'chanceCreationCrossingClass_num', 'chanceCreationCrossingClass_num', 'chanceCreationShootingClass_num', 'chanceCreationShootingClass_num', 'chanceCreationShootingClass_num', 'chanceCreationShootingClass_num', 'chanceCreationPositioningClass_num', 'chanceCreationPositioningClass_num', 'chanceCreationPositioningClass_num', 'chanceCreationPositioningClass_num', 'defencePressureClass_num', 'defencePressureClass_num', 'defencePressureClass_num', 'defencePressureClass_num', 'defenceAggressionClass_num', 'defenceAggressionClass_num', 'defenceAggressionClass_num', 'defenceAggressionClass_num', 'defenceTeamWidthClass_num', 'defenceTeamWidthClass_num', 'defenceTeamWidthClass_num', 'defenceTeamWidthClass_num', 'defenceDefenderLineClass_num', 'defenceDefenderLineClass_num', 'defenceDefenderLineClass_num', 'defenceDefenderLineClass_num']
In [38]:
plot_list = range(8, 12) # Will print buildUpPlaySpeedClass_num for each season. Edit this list.
for value in plot_list:
    plot_title = frsl_class[value].season.iloc[0] + " Scatter Plot"
    sns.stripplot(x=features_class_num_list[value], y='rank', size=3, data=frsl_class[value], 
        jitter=True).set_title(plot_title)
    plt.show()

The scatter plots were not helpful. Especially the class features converted from text to numeric. Those class features banded vertically along the x axis. Very difficult to derive any information from that. It is not possible to tease information out of the class features by taking a mean over the 4 years. This is because the teams are not static. Players, coaches, etc. move in and out season to season. They are relatively stable within the season, but many changes are made between seasons. A mean over 4 years would be fundamentally misleading. Let's look at bar charts as a way of gaining information from this dataset.

Numeric Bar Charts

In [39]:
# Rank is taken every season and used to create a quantile bar chart for each feature. 
year_rank_quart_dict = {}
for value in range(4): 
    rank_list = frsl[value]['rank'].quantile([.25, .5, .75, 1]).tolist()
    dict_update = {frsl[value].season.iloc[0]: rank_list}
    year_rank_quart_dict.update(dict_update)
print(year_rank_quart_dict)   
{'2011/2012': [46.75, 93.5, 139.25, 185.0], '2012/2013': [47.25, 93.5, 139.75, 186.0], '2013/2014': [44.25, 87.5, 130.75, 174.0], '2014/2015': [47.5, 96.5, 142.75, 188.0]}
In [40]:
# Calculate the quartile rank for each season.
def calc_quartile(rank, season_key):
    quartile_list = year_rank_quart_dict[season_key]
    if rank <= quartile_list[0]:
        quartile = 1
    elif rank <= quartile_list[1]:
        quartile = 2
    elif rank <= quartile_list[2]:
        quartile = 3
    else:
        quartile = 4
    return quartile
In [41]:
# Create the dataframe for a simple bar chart. 
bar_list = []
for index, df in enumerate(frsl):
    season_key = df.season.iloc[0]
    df['quartile'] = df['rank'].apply(calc_quartile, args=[season_key])
    df = df[[features_list[index], 'quartile']]
    df = df.groupby('quartile').mean().reset_index()
    bar_list.append(df)
In [42]:
# Calculate title for bar chart titles
def calc_title(value, feature):
    yvl_1112 = range(0, 48, 4)
    yvl_1213 = range(1, 48, 4)
    yvl_1314 = range(2, 48, 4)
    if value in yvl_1112:
        return seasons[0]
    elif value in yvl_1213:
        return seasons[1]
    elif value in yvl_1314:
        return seasons[2]
    else:
        return seasons[3]
In [43]:
plot_list = range(0, 4) # Prints buildUpPlaySpeedClass for each season as a bar chart. Edit list.
for value in plot_list:
    bar_list[value].plot(x='quartile', label=features_list[value], figsize=(8, 5), 
         title=calc_title(value, features_list[value]) + ' Simple Bar Chart', kind='bar')

Again, these bar charts were not helpful. The differences are very small between the rank quartiles and each feature.

Class (Text) Converted to Numeric Bar Charts</a></li>

Continuing on with the simple bar charts for the class features.

In [44]:
# Create the dataframe for a simple bar chart. 
bar_class_list = []
for index, df in enumerate(frsl_class):
    season_key = df.season.iloc[0]
    df['quartile'] = df['rank'].apply(calc_quartile, args=[season_key])
    df = df.groupby('quartile').mean().reset_index()
    df = df[[features_class_num_list[index], 'quartile']]
    bar_class_list.append(df)
In [45]:
plot_list = range(8, 12) # Will print buildUpPlaySpeedClass_num for each season as a bar chart

for value in plot_list:
    bar_class_list[value].plot(x='quartile', label=features_list[value], figsize=(8, 5), 
        title=calc_title(value, features_class_num_list[value]) + ' Simple Bar Chart', kind='bar')

The review of these bar charts often show a very consistent pattern between rank 1 and rank 4 teams per feature. However, the differences are usually very small. To illuminate this better I am going to create a Numeric Share of Difference Bar Charts as below.

Numeric Share Of Difference Bar Charts

The above bar charts did not provide substantial additional information. There are only small differences between personnel and teams at the professional level of soccer. A Numeric Share of Difference Bar Chart should be more instructive.

  • Create a bar chart that calculates the difference between the mean of the quartiles. That is of the percentage difference how much is "owned" by each quartile. Plots will open in a separate window.
  • The purpose of this bar chart is to highlight the differences between the top quartile and the other quartiles. The patterns that I am looking for are:
    • A cascade from the top quartile down to the lowest quartile (or vica versa) and significant differences between them.
    • A pattern of the top quartile owning virtually all difference between all of the quartiles, especially as compared to quartile 2.
    • I would also like to see that pattern repeated (generally) over all 4 years. If I see these patterns, then there is possibly a correlation between this feature and being top quartile ranked.
  • These are the most instructive plots that I have been able to derive.

Step Down That Shows Interesting Pattern

dp.png

Step Up That Shows Interesting Pattern

bupp.png

Own The Difference That Shows Interesting Pattern

da.png

In [46]:
# Deal with features_list.
In [47]:
print(features_list)
['buildUpPlaySpeed', 'buildUpPlaySpeed', 'buildUpPlaySpeed', 'buildUpPlaySpeed', 'buildUpPlayPassing', 'buildUpPlayPassing', 'buildUpPlayPassing', 'buildUpPlayPassing', 'chanceCreationPassing', 'chanceCreationPassing', 'chanceCreationPassing', 'chanceCreationPassing', 'chanceCreationCrossing', 'chanceCreationCrossing', 'chanceCreationCrossing', 'chanceCreationCrossing', 'chanceCreationShooting', 'chanceCreationShooting', 'chanceCreationShooting', 'chanceCreationShooting', 'defencePressure', 'defencePressure', 'defencePressure', 'defencePressure', 'defenceAggression', 'defenceAggression', 'defenceAggression', 'defenceAggression', 'defenceTeamWidth', 'defenceTeamWidth', 'defenceTeamWidth', 'defenceTeamWidth', 'defenceTeamWidth', 'defenceTeamWidth', 'defenceTeamWidth', 'defenceTeamWidth']
In [48]:
fnsoddfl = [] # Feature Numeric Share Of Difference DataFrame List

for value in range(len(features_list)):
    heights = []
    df = bar_list[value].copy()
    raw_heights = df[features_list[value]].tolist()
    for index, height in enumerate(raw_heights):
        min_value = min(raw_heights)
        max_value = max(raw_heights)
        result = (height - min_value + .0001)/(max_value - min_value + .01) * 100
        heights.append(result)
    df[features_list[value]] = heights
    fnsoddfl.append(df)
In [49]:
# Change the plot list to see different feature's bar chart.

plot_list = range(0, 4) # Will print buildUpPlaySpeedClass for each season as a bar chart
for value in plot_list:
    title_suffix = ' Percentage Mean Difference With Rank Quartile'
    locations = [1, 2, 3, 4]
    labels = ('1', '2', '3', '4')
    heights = fnsoddfl[value][features_list[value]]
    plt.subplots(figsize=(8, 5))
    plt.bar(locations, heights, tick_label=labels)
    plt.title(calc_title(value, features_list[value]) + title_suffix)
    plt.xlabel('Rank Quartile')
    plt.ylabel(features_list[value]);

The features that showed the patterns that we were looking for are buildUpPlayPassing, chanceCreationShooting, defenceAggression, and defencePressure.

bupp.png ccs.png da.png dp.png

Class (Text) Converted to Numeric Share Of Difference Bar Charts</a></li>

In [50]:
fcnsoddfl = [] # Feature Class Num Share Of Difference DataFrame List

for value in range(len(features_class_num_list)):
    heights = []
    df = bar_class_list[value].copy()
    raw_heights = df[features_class_num_list[value]].tolist()
    for index, height in enumerate(raw_heights):
        min_value = min(raw_heights)
        max_value = max(raw_heights)
        result = (height - min_value + .0001)/(max_value - min_value + .01) * 100
        heights.append(result)
    df[features_class_num_list[value]] = heights
    fcnsoddfl.append(df)
In [51]:
plot_list = range(0, 4) # Will print buildUpPlaySpeedClass_num for each season as a bar chart

for value in plot_list:
    title_suffix = ' Percentage Mean Difference With Rank Quartile'
    locations = [1, 2, 3, 4]
    labels = ('1', '2', '3', '4')
    heights = fcnsoddfl[value][features_class_num_list[value]]
    plt.subplots(figsize=(8, 5))
    plt.bar(locations, heights, tick_label=labels)
    plt.title(calc_title(value, features_class_num_list[value]) + title_suffix)
    plt.xlabel('Rank Quartile')
    plt.ylabel(features_class_num_list[value]);

The features that showed the patterns that we were looking for consistently across all 4 of the seasons were buildUpPlayPassingClass_num and chanceCreationPassingClass_num.

buppc_num.png ccpc_num.png

Correlation Between Significant Features Scatter Plots</a></li>

Now I need to understand the correlations, if any, between the most significant variables. The variables that I have chosen based on the preceding percentage of difference bar charts are in the Significant Features List (sfl) below.

In [52]:
sfl = ['buildUpPlayPassing', 'defenceAggression', 'defencePressure', 'chanceCreationShooting', 
       'buildUpPlayPositioningClass_num', 'chanceCreationPositioningClass_num']
all_features_by_year = (features_list + features_class_num_list)
fid = collections.OrderedDict()
fid = dict(zip(all_features_by_year, range(1, len(all_features_by_year)+1)))
df_years_list = []
sfl_dfs = frsl + frsl_class
df_agg_list = []
In [53]:
# Take all 4 seasons of data, trim it, and put it into one list per signficant feature.

for value in sfl:
    fid_value = value
    for value in range(fid[fid_value] - 4, fid[fid_value]):
        df = sfl_dfs[value][['team_long_name', fid_value, 'quartile']]
        df = df[df['quartile'] == 1]
        df_years_list.append(df)

# Append the 4 files together. 
for i in range(len(sfl)):
    fyl_index = range(0, len(sfl*4), 4)
    offset = fyl_index[i]
    df = df_years_list[offset].append(df_years_list[offset + 1], ignore_index=True)
    df1 = df.append(df_years_list[offset + 2], ignore_index=True)
    df2 = df1.append(df_years_list[offset + 3], ignore_index=True)
    df2 = df2.drop(['quartile'], axis=1)
    df_agg_list.append(df2)
In [54]:
col_list = range(len(sfl))
row_list = range(len(sfl))
In [55]:
# Merged pair list
mpl = []
pair = (0, 0)
for col in col_list:
    for row in row_list:
        if col == row:
            pass
        elif (row, col) in mpl or (col, row) in mpl:
            pass
        else:
            pair = (col, row)
            mpl.append(pair)
In [56]:
# Merge the dataframes into pairs for scatter plot purposes to test correlation.
# NB IF you run this once and you want to run it again, you need to go back 4 cells and start 
# there otherwise it will fail.
merged_features = []
for item in mpl:
    left_side, right_side = item
    df = df_agg_list[left_side].merge(df_agg_list[right_side], left_index=True, right_index=True, how='left')
    df.drop(['team_long_name_y'], axis=1, inplace=True)
    df.rename(columns={'team_long_name_x':'team_long_name'}, inplace=True)
    merged_features.append(df)
    plot_title = 'Top Rank Quartile Correlation For 2011-2015 Seasons'
    df.plot(x=sfl[left_side], y=sfl[right_side], 
            figsize=(8, 5), title=plot_title, kind='scatter')