Titanic example for xgboost¶
introduction to the topic: Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models¶
Imports¶
import dalex as dx
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')
import plotly
plotly.offline.init_notebook_mode()
dx.__version__
Basic example with xgb.train
¶
We will consider the most basic example first. Due to categorical variables in the dataset, that are not handled by default by xgb
, we will drop them in the first example.
data = dx.datasets.load_titanic()
X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived
params = {
"max_depth": 5,
"objective": "binary:logistic",
"eval_metric": "auc"
}
train = xgb.DMatrix(X, label=y)
classifier = xgb.train(params, train, verbose_eval=1)
Note that despite special data format needed by the xgb
, you pass pandas.DataFrame
to the Explainer
. You have to use Data Frame in all interactions with the Explainer
.
exp = dx.Explainer(classifier, X, y)
Again, note that X
is just a Data Frame.
exp.predict(X)
exp.model_parts().plot()
Combining xgboost with sklearn
's Pipeline
- easy example¶
You can create more complex models with sklearn
's Pipeline
. We will use all data available now.
Using xgb.XGBClassifier
or xgb.XGBRegressor
class in Pipeline
does not differs from any sklearn
classifier.
However, using model in the Pipeline
removes information about its class. Sometimes it may have absolutely no impact due to the same interface, though sometimes you have to pass information about inner model class (the one that actually makes a prediction) or create your own predict function with appropriate interface.
data = dx.datasets.load_titanic()
X = data.drop(columns='survived')
y = data.survived
numerical_features = ['age', 'fare', 'sibsp', 'parch']
numerical_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]
)
categorical_features = ['gender', 'class', 'embarked']
categorical_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
]
)
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', xgb.XGBClassifier())])
clf.fit(X, y)
exp = dx.Explainer(clf, X, y)
exp.model_performance(model_type='classification').plot(geom='roc')
exp.model_parts().plot()
Non standard use¶
Explainer needs to know how the predictions are made. Its constructor tries to find out it by itself. However, it is not always possible. For example, if your model is wrapped with something unknown, with nonstandard behaviour you have to tell how to treat this model.
The easiest way is to pass model_class
parameter telling what's the model actual interface.
Easy (and a bit fake) example¶
This example has no use, though it is easy to understand and use as a framework for more practical examples.
class Wrapper:
def __init__(self, model):
self.model = model
def predict(self, dmatrix):
return self.model.predict(dmatrix)
params = {
"max_depth": 5,
"objective": "binary:logistic",
"eval_metric": "auc"
}
data = dx.datasets.load_titanic()
X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived
train = xgb.DMatrix(X, label=y)
inner_model = xgb.train(params, train, verbose_eval=1)
wrapped = Wrapper(classifier)
type(inner_model)
type(wrapped)
Such model has non standard interface because it need DMatrix. Note that class of this model is __main__.Wrapper
thus Explainer
will use default predict methods, which causes an error.
exp = dx.Explainer(wrapped, X, y)
However the inner model: xgb.train
is known to the Explainer
so you can just inform it. You will get valid explainer that can be used normally from now on.
exp = dx.Explainer(wrapped, X, y, model_class='xgboost.core.Booster')
exp.model_profile().plot()
exp.model_info
exp.model_class
Easy (and a bit fake) example 2¶
If your base model is not known to the Explainer
you can pass a special argument predict_function
to tell how to handle it. We will use exactly the same fake model from the previous example.
class Wrapper:
def __init__(self, model):
self.model = model
def predict(self, dmatrix):
return self.model.predict(dmatrix)
params = {
"max_depth": 5,
"objective": "binary:logistic",
"eval_metric": "auc"
}
data = dx.datasets.load_titanic()
X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived
train = xgb.DMatrix(X, label=y)
inner_model = xgb.train(params, train, verbose_eval=1)
wrapped = Wrapper(classifier)
exp = dx.Explainer(wrapped, X, y, predict_function=lambda m, d: m.predict(xgb.DMatrix(d)))
exp.predict_parts(X.iloc[44, :]).plot(min_max=[0,1])
exp.model_info
exp.model_class
Plots¶
This package uses plotly to render the plots:
- Install extentions to use
plotly
in JupyterLab: Getting Started Troubleshooting - Use
show=False
parameter inplot
method to returnplotly Figure
object - It is possible to edit the figures and save them
Resources - https://dalex.drwhy.ai/python¶
Introduction to the
dalex
package: Titanic: tutorial and examplesKey features explained: FIFA20: explain default vs tuned model with dalex
How to use dalex with: xgboost, tensorflow, h2o (feat. autokeras, catboost, lightgbm)
More explanations: residuals, shap, lime
Introduction to the Fairness module in dalex
Introduction to the Aspect module in dalex
Introduction to Arena: interactive dashboard for model exploration
Code in the form of jupyter notebook
Changelog: NEWS
Theoretical introduction to the plots: Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models