Activity 1: FIFA Linear Regression

Workshop Resources

Let’s learn some machine learning to evaluate player overall ratings in FIFA video game

Machine learning is the science to study *algorithms* and *models* that enable computers to recognize things, make decisions, even predict results without explicit instructions. As an example, when talking to your phone assistant such as Siri or Cortana, machine learning helps to translate your voice into text and further understand what you requested. Is that amazing?

Today we are going to show you how to *teach* a computer evaluate overall ratings for soccer player based on their attributes step by step.

Let's get on to it!

A little background

Assume that there's a formula to calculate the "Overall" ratings for soccer players by EA Sports (The developer of FIFA 2019). With this formula, we can easily calculate the overall ratings for any player even he/she is not in the game. The problem is, we don't know what exactly the formula looks like.
We know the *input* which consists of player attributes and the *output* which is the Overall ratings. Then we can use an approach called "regression" to "estimate" the formula based on the input/output.

Today, we are going to use a simple model called Linear Regression. Let assume the formula that calculates the overall ratings of soccer player $ y = f(x)$ is \[ f(x) = ax + b \] The linear regression aims to figure out $a$ and $b$. The formula $f(x)$ is called "model" in machine learning, and the process of solve/estimate the model is called "training" the model. Once we trained the model, we can use it to predict target $y$ of new data.

Back to our story, if we only have 1 variable $x$, estimate $f(x)$ should be easy. Everyone should be able to solve it with a pen and a piece of paper. However, when $x$ is a long list of attributes of soccer players like speed, power, passing, tackling, it becomes complicated. The formula should be rewritten into \[ f(x_1, x_2, ..., x_n) = a_1 * x_1 + a_2 * x_2 + ... + a_n * x_n + b \] Then we have to feed the model with a lot of high-quality data to make the model more closer to the "real" formula. Let's get started!

Step 1: get dataset

FIFA 2019 is a video soccer game. All the players in this game have an overall rating as well as a lot of attributes such as crossing, finishing, etc.

We are heading to the website called to get our dataset.

FIFA19 dataset

*Note: you may need to sign up to get the download link*.

On this page, you can find a lot of information about this dataset, take some time to browse it and familiarize the dataset. kaggle_fifa_dataset.png

After you download it, extract the zip file to a folder, let’s say C:\fifa_dataset\.

Step 2: start the project

Open jupyter notebook, new notebook > python 3

At the beginning of the file, let’s import some necessary packages.

# Importing necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

ImportError                               Traceback (most recent call last)

<ipython-input-2-122d997e4faf> in <module>()
      1 # Importing necessary packages
----> 2 import pandas as pd
      3 import numpy as np
      4 import matplotlib.pyplot as plt
      5 from sklearn.linear_model import LinearRegression

ImportError: No module named pandas

Step 3: load dataset

Change mypath to the folder you extract the dataset file (i.e., C:\fifa_dataset\). To verify we loaded it successfully, we use a function called describe() to print its statistics.

# load datasets
mypath = "C:/Users/ruilliu/Documents/nuevo_lr_fifa/" # change it to your own path
fifa_data = pd.read_csv(mypath+"data.csv")

NameError                                 Traceback (most recent call last)

<ipython-input-3-f099c0f24a52> in <module>()
      1 # load datasets
      2 mypath = "C:/Users/ruilliu/Documents/nuevo_lr_fifa/" # change it to your own path
----> 3 fifa_data = pd.read_csv(mypath+"data.csv")
      4 fifa_data.describe()

NameError: name 'pd' is not defined

Step 4: Pre-process data

By now, we have imported our dataset. In real life, each soccer player has a specific position. Different positions require strength in different attributes. So let’s narrow down the scope to the striker.

First, let’s list all the positions. This statement looks a little bit longer, but it does the work. The fifa_data['position'] selects position column of the fifa_data, the dropna() eliminates cells that are blank, and unique() remove all duplicated items for us.

# to find out how many positions are there
['RF' 'ST' 'LW' 'GK' 'RCM' 'LF' 'RS' 'RCB' 'LCM' 'CB' 'LDM' 'CAM' 'CDM'
 'LS' 'LCB' 'RM' 'LAM' 'LM' 'LB' 'RDM' 'RW' 'CM' 'RB' 'RAM' 'CF' 'RWB'

Now we can filter data by position “ST”. You’re encouraged to select other positions to see what’s the difference.

# get players by position
fifa_data_by_pos = fifa_data[fifa_data['Position']=='ST']

Let’s plot a histogram for overall ratings of all strikers.

plt.hist(x=fifa_data_by_pos[target], bins=10, alpha=0.75, rwidth=0.85)
(array([ 40., 186., 363., 463., 601., 341., 113.,  34.,   9.,   2.]),
 array([47. , 51.7, 56.4, 61.1, 65.8, 70.5, 75.2, 79.9, 84.6, 89.3, 94. ]),
 <a list of 10 Patch objects>)


Next, we want to split the data into two sets, one is used to train the model, another one is used to verify the trained model is good.

You may think, we should leave as much as possible data for training because it makes the model better. The model fits better, but only for training datasets. When you apply the model to testing data, the prediction accuracy could go down. This is called “overfitting”.

Now, we leave 25% of the data for testing.

# split data into train_data and test_data randomly
# you're weclome to change the ratio of test_size to see what will happen
train_data, test_data = train_test_split(fifa_data_by_pos,test_size=0.25)

# print the number of players in train_data and test_data
# len() gives you the number of players in numerical format
# str() converts numerical value into string
print("The # of training data is " + str(len(train_data)))
print("The # of testing data is " + str(len(test_data)))
The # of training data is 1614
The # of testing data is 538

Step 5: feature selection

Our next step is selecting proper features. Feature selection is a term in machine learning to describe the method and process of choosing relevant features for the model. A feature is one $x$ in the formula. In our story, it is an attribute of a soccer player.

Since we are using the linear regression model, how attribute correlated to the target (“Overall”) becomes the criteria to choose the right features.

We use a built-in function correlation corr to Compute pairwise correlation of columns. There are three methods we can choose from,

In this tutorial we use pearson.

# select target
target = "Overall"

# To find the correlation among the columns using pearson method 
feature_corr = train_data.corr(method ='pearson') [target]

# sort the features
feature_corr = feature_corr.sort_values(ascending = False)

# show the top 20 features
# note that we are start from 1 not zero, because Overall is alwasy on the top of the list
Positioning        0.904367
Special            0.903856
Finishing          0.899783
BallControl        0.896988
ShotPower          0.877842
Reactions          0.861441
Volleys            0.834433
Composure          0.827529
ShortPassing       0.813074
Dribbling          0.802565
LongShots          0.794059
HeadingAccuracy    0.711129
Vision             0.671054
Skill Moves        0.649300
Curve              0.641426
Crossing           0.603249
Potential          0.593139
Penalties          0.583906
LongPassing        0.575092
FKAccuracy         0.569704
Name: Overall, dtype: float64

Now, we can copy and paste the top 10 or top 12 features. (Note: Please don’t copy the space)

# select some features
features = ["Positioning", "Finishing", "Special", "BallControl", 
            "ShotPower", "Reactions", "Volleys", "Composure", "ShortPassing"]

Also, we can just extract the feature names from the index. Note that, we start from 1 because we do not want include overall who is alwasy on the top of the list.

# extract feature names from the series
features = feature_corr[1:21].index.tolist()

# show the features
['Positioning', 'Special', 'Finishing', 'BallControl', 'ShotPower', 'Reactions', 'Volleys', 'Composure', 'ShortPassing', 'Dribbling', 'LongShots', 'HeadingAccuracy', 'Vision', 'Skill Moves', 'Curve', 'Crossing', 'Potential', 'Penalties', 'LongPassing', 'FKAccuracy']

Step 6: train the model

Now we are ready to train the model. we use ‘LinearRegression().fit()’ to train it. and this model object has a score() function to return the score of the model, which is the coefficient of determination R^2 of the prediction. For now you only need to know the higher the better.

# prepare training data
x_train = train_data[features]
y_train = train_data[target]

# Applying Linear regression
# fit() is the method to train the model
model = LinearRegression().fit(x_train,y_train)

# Model's score
print("Score: " + str(model.score(x_train,y_train)))
Score: 0.9875123836174596

Step 7: try the model on testing data

Now we are using the trained model to estimate players in test_data. Similar to what we do to the train_data, we create x_test and y_test.

model.predict() will generate a list of predicted results.

# we would like to sort test data on target value ("Overall")
test_data = test_data.sort_values([target], ascending=False)

x_test = test_data[features]
y_test = test_data[target]

y_pred = model.predict(x_test)

Let’s compare with the actual overall ratings

# add a new column of predicted overall to test_data
test_data['Predicted Overall'] = y_pred.copy()

# add a new column of prediction difference ratio to test_data
difference = (y_pred - y_test) / y_test * 100
test_data['Difference (%)'] = difference

# print the results
test_data[["Name", "Nationality", "Club", "Overall", "Predicted Overall", "Difference (%)"]]

NameNationalityClubOverallPredicted OverallDifference (%)
1Cristiano RonaldoPortugalJuventus9491.973701-2.155638
10R. LewandowskiPolandFC Bayern München9088.135513-2.071652
23S. AgüeroArgentinaManchester City8987.807637-1.339733
48C. ImmobileItalyLazio8785.933234-1.226168
159Louri BerettaBrazilAtlético Mineiro8381.583941-1.706096
193RodrigoSpainValencia CF8381.784946-1.463921
179S. GnabryGermanyFC Bayern München8379.978980-3.639783
315David VillaSpainNew York City FC8281.259066-0.903578
362Paco AlcácerSpainBorussia Dortmund8181.8365321.032756
518Alexandre PatoBrazilTianjin Quanjian FC8078.322831-2.096461
499L. de JongNetherlandsPSV8079.993062-0.008672
523K. GameiroFranceValencia CF8079.130702-1.086622
721B. YılmazTurkeyTrabzonspor7978.092396-1.148866
693S. JovetićMontenegroAS Monaco7979.3530440.446891
591L. AlarioArgentinaBayer 04 Leverkusen7979.0664460.084109
569André SilvaPortugalSevilla FC7979.9252291.171175
588M. PhilippGermanyBorussia Dortmund7978.962674-0.047248
561L. MartínezArgentinaInter7979.4119400.521443
874A. DzyubaRussiaNaN7876.855093-1.467829
825S. GarcíaUruguayGodoy Cruz7877.375588-0.800528
909V. GermainFranceOlympique de Marseille7777.5090050.661045
1095N. JørgensenDenmarkFeyenoord7776.745918-0.329976
992J. SandArgentinaDeportivo Cali7778.8861692.449570
1137Rubén CastroSpainUD Las Palmas7777.7979841.036343
895M. HarnikAustriaSV Werder Bremen7776.926679-0.095222
1413Alan CarvalhoBrazilGuangzhou Evergrande Taobao FC7675.922866-0.101492
1327K. DolbergDenmarkAjax7676.0608310.080041
1496F. MonteroColombiaSporting CP7677.0171871.338404
1240I. PopovBulgariaSpartak Moscow7675.734350-0.349540
1357I. SlimaniAlgeriaFenerbahçe SK7676.4945070.650667
17484J. LankesterEnglandIpswich Town5456.1218843.929415
17469J. GallagherRepublic of IrelandAtlanta United5454.6924441.282304
17501M. SaavedraChileAudax Italiano5454.1374630.254561
17361E. McKeownEnglandColchester United5452.796085-2.229473
17399Mao HaoyuChina PRTianjin TEDA FC5453.964477-0.065783
17313M. HowardEnglandPreston North End5453.339370-1.223389
17355V. BarberoArgentinaBelgrano de Córdoba5454.0113440.021008
17422Y. OgakiJapanNagoya Grampus5454.0410240.075970
17447Xie WeijunChina PRTianjin TEDA FC5453.452376-1.014118
17367T. LauritsenNorwayOdds BK5454.9446411.749336
17482F. Al BirekanSaudi ArabiaAl Nassr5452.727175-2.357084
17609S. JamiesonScotlandSt. Mirren5353.5096500.961604
17716M. KnoxScotlandLivingston FC5352.826053-0.328201
17578Lei WenjieChina PRShanghai SIPG FC5352.770581-0.432867
17665J. SmylieAustraliaCentral Coast Mariners5352.469974-1.000049
17611Felipe FerreyraBrazilCuricó Unido5352.861431-0.261451
17765A. GeorgiouCyprusStevenage5252.1677860.322665
17757L. SmythNorthern IrelandStevenage5251.999942-0.000111
17923A. ReghbaRepublic of IrelandBohemian FC5151.0755010.148041
17956C. MurphyRepublic of IrelandCork City5151.7319851.435265
17971M. NajjarAustraliaMelbourne City FC5151.0355410.069688
18013W. MøllerDenmarkEsbjerg fB5150.796960-0.398118
18062Gao DalunChina PRJiangsu Suning FC5049.677371-0.645259
18094M. Al DhafeeriSaudi ArabiaAl Batin5051.5539643.107928
18063R. Hackett-FairchildEnglandCharlton Athletic5050.1407620.281524
18028D. AsonganyiEnglandMilton Keynes Dons5050.3498960.699792
18140K. HawleyEnglandMorecambe4949.7873321.606799
18166N. AyévaSwedenÖrebro SK4848.8029351.672781
18177R. RoacheRepublic of IrelandBlackpool4849.2260152.554197
18200J. YoungScotlandSwindon Town4748.0193872.168908

538 rows × 6 columns

Is that amazing? With the result, you’re confident to use this model to estimate the overall ratings of any soccer player in the world!

Now let’s do some plotting to visualize it.

# Plot outputs
plt.scatter(range(0,y_test.shape[0]), y_test,  color='blue', label="Actual")
plt.plot(range(0,y_test.shape[0]), y_pred, color='red', label="Predicted")

# add ticks, labels, legend
plt.xlabel("Players (Sorted by Actual Overall ratings)")
plt.ylabel("Overall ratings")
plt.legend(loc='upper right')



Well done! You did it!

Next, you can play with this dataset a little bit.

In today’s class, you learned how to train a linear regression model to estimate the overall ratings of a soccer player. We hope you enjoyed it and have inspired a little.

From now, you can explore the kaggle website, try to find another dataset to play. Apply linear regression to predict/estimate the results. You’ll be amazed by what you can be done.