Titanic example for xgboost

Imports

In [1]:
import dalex as dx

import pandas as pd
import numpy as np

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')

import plotly
plotly.offline.init_notebook_mode()
In [2]:
dx.__version__
Out[2]:
'1.7.0'

Basic example with xgb.train

We will consider the most basic example first. Due to categorical variables in the dataset, that are not handled by default by xgb, we will drop them in the first example.

In [3]:
data = dx.datasets.load_titanic()

X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived
In [4]:
params = {
    "max_depth": 5,
    "objective": "binary:logistic",
    "eval_metric": "auc"
}

train = xgb.DMatrix(X, label=y)
classifier = xgb.train(params, train, verbose_eval=1)

Note that despite special data format needed by the xgb, you pass pandas.DataFrame to the Explainer. You have to use Data Frame in all interactions with the Explainer.

In [5]:
exp = dx.Explainer(classifier, X, y)
Preparation of a new explainer is initiated

  -> data              : 2207 rows 4 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.core.Booster (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_xgboost at 0x1176dc4a0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0498, mean = 0.322, max = 0.893
  -> model type        : 'model_type' not provided and cannot be extracted.
  -> model type        : Some functionalities won't be available.
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.787, mean = -0.000182, max = 0.857
  -> model_info        : package xgboost

A new explainer has been created!

Again, note that X is just a Data Frame.

In [6]:
exp.predict(X)
Out[6]:
array([0.17432815, 0.63725615, 0.52524376, ..., 0.26590124, 0.21221745,
       0.24626721], dtype=float32)
In [7]:
exp.model_parts().plot()

Combining xgboost with sklearn's Pipeline - easy example

You can create more complex models with sklearn's Pipeline. We will use all data available now.

Using xgb.XGBClassifier or xgb.XGBRegressor class in Pipeline does not differs from any sklearn classifier.

However, using model in the Pipeline removes information about its class. Sometimes it may have absolutely no impact due to the same interface, though sometimes you have to pass information about inner model class (the one that actually makes a prediction) or create your own predict function with appropriate interface.

In [8]:
data = dx.datasets.load_titanic()

X = data.drop(columns='survived')
y = data.survived
In [9]:
numerical_features = ['age', 'fare', 'sibsp', 'parch']
numerical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]
)

categorical_features = ['gender', 'class', 'embarked']
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', xgb.XGBClassifier())])
In [10]:
clf.fit(X, y)
Out[10]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare', 'sibsp',
                                                   'parch']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['gender', 'cl...
                               feature_types=None, gamma=None, grow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, multi_strategy=None,
                               n_estimators=None, n_jobs=None,
                               num_parallel_tree=None, random_state=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [11]:
exp = dx.Explainer(clf, X, y)
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x1176dc400> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.000258, mean = 0.322, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.756, mean = 0.000108, max = 0.967
  -> model_info        : package sklearn

A new explainer has been created!
In [12]:
exp.model_performance(model_type='classification').plot(geom='roc')
In [13]:
exp.model_parts().plot()

Non standard use

Explainer needs to know how the predictions are made. Its constructor tries to find out it by itself. However, it is not always possible. For example, if your model is wrapped with something unknown, with nonstandard behaviour you have to tell how to treat this model.

The easiest way is to pass model_class parameter telling what's the model actual interface.

Easy (and a bit fake) example

This example has no use, though it is easy to understand and use as a framework for more practical examples.

In [14]:
class Wrapper:
    def __init__(self, model):
        self.model = model
        
    def predict(self, dmatrix):
        return self.model.predict(dmatrix)
In [15]:
params = {
    "max_depth": 5,
    "objective": "binary:logistic",
    "eval_metric": "auc"
}

data = dx.datasets.load_titanic()

X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived

train = xgb.DMatrix(X, label=y)
inner_model = xgb.train(params, train, verbose_eval=1)

wrapped = Wrapper(classifier)
In [16]:
type(inner_model)
Out[16]:
xgboost.core.Booster
In [17]:
type(wrapped)
Out[17]:
__main__.Wrapper

Such model has non standard interface because it need DMatrix. Note that class of this model is __main__.Wrapper thus Explainer will use default predict methods, which causes an error.

In [18]:
exp = dx.Explainer(wrapped, X, y)
Preparation of a new explainer is initiated

  -> data              : 2207 rows 4 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : __main__.Wrapper (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x1176dc360> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         :  'residual_function' returns an Error when executed:
('Expecting data to be a DMatrix object, got: ', <class 'pandas.core.frame.DataFrame'>)
  -> model_info        : package __main__

A new explainer has been created!

However the inner model: xgb.train is known to the Explainer so you can just inform it. You will get valid explainer that can be used normally from now on.

In [19]:
exp = dx.Explainer(wrapped, X, y, model_class='xgboost.core.Booster')
Preparation of a new explainer is initiated

  -> data              : 2207 rows 4 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.core.Booster
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_xgboost at 0x1176dc4a0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0498, mean = 0.322, max = 0.893
  -> model type        : 'model_type' not provided and cannot be extracted.
  -> model type        : Some functionalities won't be available.
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.787, mean = -0.000182, max = 0.857
  -> model_info        : package __main__

A new explainer has been created!
In [20]:
exp.model_profile().plot()
Calculating ceteris paribus: 100%|██████████| 4/4 [00:00<00:00, 245.14it/s]
In [21]:
exp.model_info
Out[21]:
{'model_package': '__main__',
 'model_class_default': False,
 'label_default': True,
 'predict_function_default': True,
 'arrays_accepted': False,
 'residual_function_default': True}
In [22]:
exp.model_class
Out[22]:
'xgboost.core.Booster'

Easy (and a bit fake) example 2

If your base model is not known to the Explainer you can pass a special argument predict_function to tell how to handle it. We will use exactly the same fake model from the previous example.

In [23]:
class Wrapper:
    def __init__(self, model):
        self.model = model
        
    def predict(self, dmatrix):
        return self.model.predict(dmatrix)
In [24]:
params = {
    "max_depth": 5,
    "objective": "binary:logistic",
    "eval_metric": "auc"
}

data = dx.datasets.load_titanic()

X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived

train = xgb.DMatrix(X, label=y)
inner_model = xgb.train(params, train, verbose_eval=1)

wrapped = Wrapper(classifier)
In [25]:
exp = dx.Explainer(wrapped, X, y, predict_function=lambda m, d: m.predict(xgb.DMatrix(d)))
Preparation of a new explainer is initiated

  -> data              : 2207 rows 4 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : __main__.Wrapper (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function <lambda> at 0x177c04040> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0498, mean = 0.322, max = 0.893
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.787, mean = -0.000182, max = 0.857
  -> model_info        : package __main__

A new explainer has been created!
In [26]:
exp.predict_parts(X.iloc[44, :]).plot(min_max=[0,1])
In [27]:
exp.model_info
Out[27]:
{'model_package': '__main__',
 'Pipeline': False,
 'model_class_default': True,
 'label_default': True,
 'predict_function_default': False,
 'arrays_accepted': False,
 'model_type_default': True,
 'residual_function_default': True}
In [28]:
exp.model_class
Out[28]:
'__main__.Wrapper'

Plots

This package uses plotly to render the plots:

Resources - https://dalex.drwhy.ai/python