FIFA 20: explain default vs tuned model with dalex

imports

In [1]:
import dalex as dx 

import numpy as np
import pandas as pd

from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')
In [2]:
dx.__version__
Out[2]:
'1.4.0'

load data

Load fifa, the preprocessed players_20 dataset. It contains 5000 'overall' best players and 43 columns. These are:

  • short_name (index)
  • nationality of the player (not used in modeling)
  • overall, potential, value_eur, wage_eur (4 potential target variables)
  • age, height, weight, attacking skills, defending skills, goalkeeping skills (37 variables)

It is advised to leave only one target variable for modeling.

In [3]:
data = dx.datasets.load_fifa()
In [4]:
data.head(10)
Out[4]:
nationality overall potential wage_eur value_eur age height_cm weight_kg attacking_crossing attacking_finishing ... mentality_penalties mentality_composure defending_marking defending_standing_tackle defending_sliding_tackle goalkeeping_diving goalkeeping_handling goalkeeping_kicking goalkeeping_positioning goalkeeping_reflexes
short_name
L. Messi Argentina 94 94 565000 95500000 32 170 72 88 95 ... 75 96 33 37 26 6 11 15 14 8
Cristiano Ronaldo Portugal 93 93 405000 58500000 34 187 83 84 94 ... 85 95 28 32 24 7 11 15 14 11
Neymar Jr Brazil 92 92 290000 105500000 27 175 68 87 87 ... 90 94 27 26 29 9 9 15 15 11
J. Oblak Slovenia 91 93 125000 77500000 26 188 87 13 11 ... 11 68 27 12 18 87 92 78 90 89
E. Hazard Belgium 91 91 470000 90000000 28 175 74 81 84 ... 88 91 34 27 22 11 12 6 8 8
K. De Bruyne Belgium 91 91 370000 90000000 28 181 70 93 82 ... 79 91 68 58 51 15 13 5 10 13
M. ter Stegen Germany 90 93 250000 67500000 27 187 85 18 14 ... 25 70 25 13 10 88 85 88 88 90
V. van Dijk Netherlands 90 91 200000 78000000 27 193 92 53 52 ... 62 89 91 92 85 13 10 13 11 11
L. Modric Croatia 90 90 340000 45000000 33 172 66 86 72 ... 82 92 68 76 71 13 9 7 14 9
M. Salah Egypt 90 90 240000 80500000 27 175 71 79 90 ... 77 91 38 43 41 14 14 9 11 14

10 rows × 42 columns

Divide the data into variables X and a target variable y. Here we will be predicting the value of the best players.

In [5]:
X = data.drop(["nationality", "overall", "potential", "value_eur", "wage_eur"], axis = 1)
y = data['value_eur']

The target variable is skewed so we transform it with log for a better fit.

In [6]:
ylog = np.log(y)

import matplotlib.pyplot as plt
plt.hist(ylog, bins='auto')
plt.title("ln(value_eur)")
plt.show()

Split the data into train and test.

In [7]:
X_train, X_test, ylog_train, ylog_test, y_train, y_test = \
    train_test_split(X, ylog, y, test_size=0.25, random_state=4)

create a default boosting model

In [8]:
gbm_default = LGBMRegressor()

gbm_default.fit(X_train, ylog_train, verbose = False)
Out[8]:
LGBMRegressor()

create a tuned model

In [9]:
gbm_default._estimator_type
Out[9]:
'regressor'
In [10]:
#:# hp tuning
estimator = LGBMRegressor(n_jobs = -1)
param_test = {
    'n_estimators': list(range(201,1202,50)),
    'num_leaves': list(range(6, 42, 5)),
    'min_child_weight': [1e-3, 1e-2, 1e-1, 15e-2],
    'learning_rate': [1e-3, 1e-2, 1e-1, 15e-2]
}

rs = RandomizedSearchCV(
    estimator=estimator, 
    param_distributions=param_test, 
    n_iter=100,
    cv=4,
    random_state=1
)

# rs.fit(X, ylog)
# print('Best score reached: {} with params: {} '.format(rs.best_score_, rs.best_params_))
In [11]:
#:# best parameters after 100 iterations
best_params = {'num_leaves': 6,
               'n_estimators': 951,
               'min_child_weight': 0.1,
               'learning_rate': 0.15} 
In [12]:
gbm_tuned = LGBMRegressor(**best_params)
gbm_tuned.fit(X_train, ylog_train)
Out[12]:
LGBMRegressor(learning_rate=0.15, min_child_weight=0.1, n_estimators=951,
              num_leaves=6)

create explainers for the models

We aim to see real values of the target variable in the explanations (not log). Therefore, we need to make a custom predict_function.

In [13]:
def predict_function(model, data):
    return np.exp(model.predict(data))
In [14]:
exp_default = dx.Explainer(gbm_default, X_test, y_test,
                           predict_function=predict_function, label='default')
exp_tuned = dx.Explainer(gbm_tuned, X_test, y_test,
                         predict_function=predict_function, label='tuned')
Preparation of a new explainer is initiated

  -> data              : 1250 rows 37 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1250 values
  -> model_class       : lightgbm.sklearn.LGBMRegressor (default)
  -> label             : default
  -> predict function  : <function predict_function at 0x000001B2AEAA13A0> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 3.57e+05, mean = 7.12e+06, max = 8.12e+07
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1e+07, mean = 2.12e+05, max = 2.43e+07
  -> model_info        : package lightgbm

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 1250 rows 37 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1250 values
  -> model_class       : lightgbm.sklearn.LGBMRegressor (default)
  -> label             : tuned
  -> predict function  : <function predict_function at 0x000001B2AEAA13A0> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 3.56e+05, mean = 7.12e+06, max = 9.51e+07
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.49e+07, mean = 2.11e+05, max = 2.41e+07
  -> model_info        : package lightgbm

A new explainer has been created!

Above functionalities are accessible from the Explainer object through its methods.

Model-level and predict-level methods return a new unique object that contains the result attribute (pandas.DataFrame) and the plot method.

In [15]:
mp_default = exp_default.model_performance("regression")
mp_default.result
Out[15]:
mse rmse r2 mae mad
default 5.727888e+12 2.393301e+06 0.923228 1.209461e+06 651771.861674
In [16]:
mp_tuned = exp_tuned.model_performance("regression")
mp_tuned.result
Out[16]:
mse rmse r2 mae mad
tuned 4.117085e+12 2.029060e+06 0.944818 1.092173e+06 595137.6644
In [17]:
mp_default.plot(mp_tuned)