FIFA 20: explain default vs tuned model with dalex¶
imports¶
import dalex as dx
import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings('ignore')
import plotly
plotly.offline.init_notebook_mode()
dx.__version__
load data¶
Load fifa, the preprocessed players_20 dataset. It contains 5000 'overall' best players and 43 columns. These are:
- short_name (index)
- nationality of the player (not used in modeling)
- overall, potential, value_eur, wage_eur (4 potential target variables)
- age, height, weight, attacking skills, defending skills, goalkeeping skills (37 variables)
It is advised to leave only one target variable for modeling.
data = dx.datasets.load_fifa()
data.head(10)
Divide the data into variables X
and a target variable y
. Here we will be predicting the value of the best players.
X = data.drop(["nationality", "overall", "potential", "value_eur", "wage_eur"], axis = 1)
y = data['value_eur']
The target variable is skewed so we transform it with log for a better fit.
ylog = np.log(y)
import matplotlib.pyplot as plt
plt.hist(ylog, bins='auto')
plt.title("ln(value_eur)")
plt.show()
Split the data into train and test.
X_train, X_test, ylog_train, ylog_test, y_train, y_test = \
train_test_split(X, ylog, y, test_size=0.25, random_state=4)
create a default boosting model¶
gbm_default = LGBMRegressor()
gbm_default.fit(X_train, ylog_train)
create a tuned model¶
gbm_default._estimator_type
#:# hp tuning
estimator = LGBMRegressor(n_jobs = -1)
param_test = {
'n_estimators': list(range(201,1202,50)),
'num_leaves': list(range(6, 42, 5)),
'min_child_weight': [1e-3, 1e-2, 1e-1, 15e-2],
'learning_rate': [1e-3, 1e-2, 1e-1, 15e-2]
}
rs = RandomizedSearchCV(
estimator=estimator,
param_distributions=param_test,
n_iter=100,
cv=4,
random_state=1
)
# rs.fit(X, ylog)
# print('Best score reached: {} with params: {} '.format(rs.best_score_, rs.best_params_))
#:# best parameters after 100 iterations
best_params = {'num_leaves': 6,
'n_estimators': 951,
'min_child_weight': 0.1,
'learning_rate': 0.15}
gbm_tuned = LGBMRegressor(**best_params)
gbm_tuned.fit(X_train, ylog_train)
create explainers for the models¶
We aim to see real values of the target variable in the explanations (not log). Therefore, we need to make a custom predict_function
.
def predict_function(model, data):
return np.exp(model.predict(data))
exp_default = dx.Explainer(gbm_default, X_test, y_test,
predict_function=predict_function, label='default')
exp_tuned = dx.Explainer(gbm_tuned, X_test, y_test,
predict_function=predict_function, label='tuned')
introduction to the topic: Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models¶
Above functionalities are accessible from the Explainer
object through its methods.
Model-level and predict-level methods return a new unique object that contains the result
attribute (pandas.DataFrame
) and the plot
method.
mp_default = exp_default.model_performance("regression")
mp_default.result
mp_tuned = exp_tuned.model_performance("regression")
mp_tuned.result
mp_default.plot(mp_tuned)
This are very big values so the difference on paper may be very subtle.
What are the differences between these two models? Let's find out.
Customize the computation with parameters:
loss_function function to use for drop-out loss evaluation
B number of bootstrap rounds (e.g.
15
for slower computation but more stable results)N number of observations to use (e.g.
500
for faster computation but less stable results)variable_groups Dict of lists of variables. Each list is treated as one group. This is for testing joint variable importance
X.columns
variable_groups = {
'age': ['age'],
'body': ['height_cm', 'weight_kg'],
'attacking': ['attacking_crossing',
'attacking_finishing', 'attacking_heading_accuracy',
'attacking_short_passing', 'attacking_volleys'],
'skill': ['skill_dribbling',
'skill_curve', 'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control'],
'movement': ['movement_acceleration', 'movement_sprint_speed',
'movement_agility', 'movement_reactions', 'movement_balance'],
'power': ['power_shot_power', 'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots'],
'mentality': ['mentality_aggression', 'mentality_interceptions',
'mentality_positioning', 'mentality_vision', 'mentality_penalties',
'mentality_composure'],
'defending': ['defending_marking', 'defending_standing_tackle',
'defending_sliding_tackle'],
'goalkeeping' : ['goalkeeping_diving',
'goalkeeping_handling', 'goalkeeping_kicking',
'goalkeeping_positioning', 'goalkeeping_reflexes']
}
vi_default = exp_default.model_parts(variable_groups=variable_groups, B=15, random_state=0)
vi_tuned = exp_tuned.model_parts(variable_groups=variable_groups, B=15)
Customize the plot with parameters:
vertical_spacing value between
0.0
and1.0
(e.g.0.15
for more space between the plots)rounding_function rounds the contributions (e.g.
np.round
,np.rint
,np.ceil
)digits (e.g.
2
fornp.round
,None
fornp.rint
)
vi_default.plot(vi_tuned,
max_vars=6, rounding_function=np.rint, digits=None, vertical_spacing=0.15)
Variables connected with body
and power
aren't important for these models. It is also true for goalkeeping
. This might mean that goalkeepers predictions aren't accurate. The most important factors in predicting players value are skill
, attacking
and movement
.
It seems like the default model is focusing on movement
variables too much and doesn't find other variables so important, especially skill
. The tuned model finds mentality
and defending
quite important. Next, we will examine these variables closer.
Aggregated Profiles¶
Choose a proper algorithm. The explanations can be calulated as Partial Dependence Profile or Accumulated Local Dependence Profile.
The key parameter is N number of observations to use (e.g. 800
for slower computation but more stable results).
Here we will use ale
plots, which work better if the explanatory variables are correlated.
ale_default = exp_default.model_profile(type = 'accumulated', N=800, label='ale-default')
ale_tuned = exp_tuned.model_profile(type = 'accumulated', N=800, label='ale-tuned')
ale_default.plot(ale_tuned, variables = ['goalkeeping_positioning', 'power_stamina',
'mentality_vision', 'defending_marking',
'attacking_finishing', 'attacking_heading_accuracy',
'attacking_short_passing', 'skill_ball_control'])
Overall, we can see that the tuned model is using more variables. Examples are defending_marking
, goalkeeping_positioning
, mentality_vision
, power_stamina
, skill_ball_control
and attacking
variables.
It also acts differently with variables like age
and movement_reactions
.
ale_default.plot(ale_tuned, variables = ['age', 'movement_reactions'])
Variable Attribution¶
Choose a proper algorithm. The explanations can be calulated as Break Down, iBreakDown or Shapley Values.
For type='shap'
the key parameter is B number of bootstrap rounds (e.g. 10
for faster computation but less stable results).
Let's find out what attributes to the value of the best players.
va = {'ibd':[], 'sh':[]}
for name in data.index[0:3]:
player = X.loc[name,]
ibd = exp_tuned.predict_parts(player, type='break_down_interactions', label=name)
sh = exp_tuned.predict_parts(player, type='shap', B=10, label=name)
va['ibd'].append(ibd)
va['sh'].append(sh)
va['ibd'][0].plot(va['ibd'][1:3],
rounding_function=lambda x, digits: np.rint(x, digits).astype(int),
digits=None, max_vars=10)
va['sh'][0].plot(va['sh'][1:3],
rounding_function=lambda x, digits: np.rint(x, digits).astype(int),
digits=None, max_vars=10)
Looking at the Break Down plots, age
and movement_ractions
variables are standing out. Let's focus on them more.
cp = exp_tuned.predict_profile(X.iloc[2:3,],
variables=['age', 'movement_reactions'],
label=X.index[2]) # variables to calculate
cp.plot(size=3, title="What If? Neymar Jr") # larger width of the line and dot size & change title
Here we see how the prediction would change if Neymar Jr
was younger/older
or had lower movement_reactions
.
Hover over all of the above plots for tooltips with more information.
Plots¶
This package uses plotly to render the plots:
- Install extentions to use
plotly
in JupyterLab: Getting Started Troubleshooting - Use
show=False
parameter inplot
method to returnplotly Figure
object - It is possible to edit the figures and save them
Resources - https://dalex.drwhy.ai/python¶
Introduction to the
dalex
package: Titanic: tutorial and examplesKey features explained: FIFA20: explain default vs tuned model with dalex
How to use dalex with: xgboost, tensorflow, h2o (feat. autokeras, catboost, lightgbm)
More explanations: residuals, shap, lime
Introduction to the Fairness module in dalex
Introduction to the Aspect module in dalex
Introduction to Arena: interactive dashboard for model exploration
Code in the form of jupyter notebook
Changelog: NEWS
Theoretical introduction to the plots: Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models