FIFA 20: explain default vs tuned model with dalex

imports

In [1]:
import dalex as dx 

import numpy as np
import pandas as pd

from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')
In [2]:
dx.__version__
Out[2]:
'0.2.1'

load data

Load fifa, the preprocessed players_20 dataset. It contains 5000 'overall' best players and 43 columns. These are:

  • short_name (index)
  • nationality of the player (not used in modeling)
  • overall, potential, value_eur, wage_eur (4 potential target variables)
  • age, height, weight, attacking skills, defending skills, goalkeeping skills (37 variables)

It is advised to leave only one target variable for modeling.

In [3]:
data = dx.datasets.load_fifa()
In [4]:
data.head(10)
Out[4]:
nationality overall potential wage_eur value_eur age height_cm weight_kg attacking_crossing attacking_finishing ... mentality_penalties mentality_composure defending_marking defending_standing_tackle defending_sliding_tackle goalkeeping_diving goalkeeping_handling goalkeeping_kicking goalkeeping_positioning goalkeeping_reflexes
short_name
L. Messi Argentina 94 94 565000 95500000 32 170 72 88 95 ... 75 96 33 37 26 6 11 15 14 8
Cristiano Ronaldo Portugal 93 93 405000 58500000 34 187 83 84 94 ... 85 95 28 32 24 7 11 15 14 11
Neymar Jr Brazil 92 92 290000 105500000 27 175 68 87 87 ... 90 94 27 26 29 9 9 15 15 11
J. Oblak Slovenia 91 93 125000 77500000 26 188 87 13 11 ... 11 68 27 12 18 87 92 78 90 89
E. Hazard Belgium 91 91 470000 90000000 28 175 74 81 84 ... 88 91 34 27 22 11 12 6 8 8
K. De Bruyne Belgium 91 91 370000 90000000 28 181 70 93 82 ... 79 91 68 58 51 15 13 5 10 13
M. ter Stegen Germany 90 93 250000 67500000 27 187 85 18 14 ... 25 70 25 13 10 88 85 88 88 90
V. van Dijk Netherlands 90 91 200000 78000000 27 193 92 53 52 ... 62 89 91 92 85 13 10 13 11 11
L. Modric Croatia 90 90 340000 45000000 33 172 66 86 72 ... 82 92 68 76 71 13 9 7 14 9
M. Salah Egypt 90 90 240000 80500000 27 175 71 79 90 ... 77 91 38 43 41 14 14 9 11 14

10 rows × 42 columns

Divide the data into variables X and a target variable y. Here we will be predicting the value of the best players.

In [5]:
X = data.drop(["nationality", "overall", "potential", "value_eur", "wage_eur"], axis = 1)
y = data['value_eur']

The target variable is skewed so we transform it with log for a better fit.

In [6]:
ylog = np.log(y)

import matplotlib.pyplot as plt
plt.hist(ylog, bins='auto')
plt.title("ln(value_eur)")
plt.show()

Split the data into train and test.

In [7]:
X_train, X_test, ylog_train, ylog_test, y_train, y_test = train_test_split(X, ylog, y, test_size=0.25, random_state=4)

create a default boosting model

In [8]:
gbm_default = LGBMRegressor()

gbm_default.fit(X_train, ylog_train, verbose = False)
Out[8]:
LGBMRegressor()

create a tuned model

In [9]:
gbm_default._estimator_type
Out[9]:
'regressor'
In [10]:
#:# hp tuning
estimator = LGBMRegressor(n_jobs = -1)
param_test = {
    'n_estimators': list(range(201,1202,50)),
    'num_leaves': list(range(6, 42, 5)),
    'min_child_weight': [1e-3, 1e-2, 1e-1, 15e-2],
    'learning_rate': [1e-3, 1e-2, 1e-1, 15e-2]
}

rs = RandomizedSearchCV(
    estimator=estimator, 
    param_distributions=param_test, 
    n_iter=100,
    cv=4,
    random_state=1
)

# rs.fit(X, ylog)
# print('Best score reached: {} with params: {} '.format(rs.best_score_, rs.best_params_))
In [11]:
#:# best parameters after 100 iterations
best_params = {'num_leaves': 6, 'n_estimators': 951, 'min_child_weight': 0.1, 'learning_rate': 0.15}
In [12]:
gbm_tuned = LGBMRegressor(**best_params)
gbm_tuned.fit(X_train, ylog_train)
Out[12]:
LGBMRegressor(learning_rate=0.15, min_child_weight=0.1, n_estimators=951,
              num_leaves=6)

create explainers for the models

We aim to see real values of the target variable in the explanations (not log). Therefore, we need to make a custom predict_function.

In [13]:
def predict_function(model, data):
    return np.exp(model.predict(data))
In [14]:
exp_default = dx.Explainer(gbm_default, X_test, y_test, predict_function=predict_function, label='default')
exp_tuned = dx.Explainer(gbm_tuned, X_test, y_test, predict_function=predict_function, label='tuned')
Preparation of a new explainer is initiated

  -> data              : 1250 rows 37 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1250 values
  -> model_class       : lightgbm.sklearn.LGBMRegressor (default)
  -> label             : default
  -> predict function  : <function predict_function at 0x0000024585867700> will be used
  -> model type        : regression will be used (default)
  -> predicted values  : min = 3.57e+05, mean = 7.12e+06, max = 8.12e+07
  -> predict function  : accepts pandas.DataFrame and numpy.ndarray
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1e+07, mean = 2.12e+05, max = 2.43e+07
  -> model_info        : package lightgbm

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 1250 rows 37 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1250 values
  -> model_class       : lightgbm.sklearn.LGBMRegressor (default)
  -> label             : tuned
  -> predict function  : <function predict_function at 0x0000024585867700> will be used
  -> model type        : regression will be used (default)
  -> predicted values  : min = 3.56e+05, mean = 7.12e+06, max = 9.51e+07
  -> predict function  : accepts pandas.DataFrame and numpy.ndarray
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.49e+07, mean = 2.11e+05, max = 2.41e+07
  -> model_info        : package lightgbm

A new explainer has been created!

dalex functions

image.png

Above functions are accessible from the Explainer object through its methods.

Each of them returns a new unique object that contains a result field in the form of a pandas.DataFrame and a plot method.

In [15]:
mp_default = exp_default.model_performance("regression")
mp_default.result
Out[15]:
mse rmse r2 mae mad
0 5.727888e+12 2.393301e+06 0.923228 1.209461e+06 651771.861674
In [16]:
mp_tuned = exp_tuned.model_performance("regression")
mp_tuned.result
Out[16]:
mse rmse r2 mae mad
0 4.117085e+12 2.029060e+06 0.944818 1.092173e+06 595137.6644
In [17]:
mp_default.plot(mp_tuned)