Titanic example for xgboost¶

introduction to the topic: Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models ¶

Imports¶

import dalex as dx

import pandas as pd
import numpy as np

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')

import plotly
plotly.offline.init_notebook_mode()

dx.__version__

'1.7.0'

Basic example with `xgb.train`¶

We will consider the most basic example first. Due to categorical variables in the dataset, that are not handled by default by xgb, we will drop them in the first example.

data = dx.datasets.load_titanic()

X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived

params = {
    "max_depth": 5,
    "objective": "binary:logistic",
    "eval_metric": "auc"
}

train = xgb.DMatrix(X, label=y)
classifier = xgb.train(params, train, verbose_eval=1)

Note that despite special data format needed by the xgb, you pass pandas.DataFrame to the Explainer. You have to use Data Frame in all interactions with the Explainer.

exp = dx.Explainer(classifier, X, y)

Preparation of a new explainer is initiated

  -> data              : 2207 rows 4 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.core.Booster (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_xgboost at 0x1176dc4a0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0498, mean = 0.322, max = 0.893
  -> model type        : 'model_type' not provided and cannot be extracted.
  -> model type        : Some functionalities won't be available.
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.787, mean = -0.000182, max = 0.857
  -> model_info        : package xgboost

A new explainer has been created!

Again, note that X is just a Data Frame.

exp.predict(X)

array([0.17432815, 0.63725615, 0.52524376, ..., 0.26590124, 0.21221745,
       0.24626721], dtype=float32)

exp.model_parts().plot()

Combining xgboost with `sklearn`'s `Pipeline` - easy example¶

You can create more complex models with sklearn's Pipeline. We will use all data available now.

Using xgb.XGBClassifier or xgb.XGBRegressor class in Pipeline does not differs from any sklearn classifier.

However, using model in the Pipeline removes information about its class. Sometimes it may have absolutely no impact due to the same interface, though sometimes you have to pass information about inner model class (the one that actually makes a prediction) or create your own predict function with appropriate interface.

data = dx.datasets.load_titanic()

X = data.drop(columns='survived')
y = data.survived

numerical_features = ['age', 'fare', 'sibsp', 'parch']
numerical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]
)

categorical_features = ['gender', 'class', 'embarked']
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', xgb.XGBClassifier())])

clf.fit(X, y)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare', 'sibsp',
                                                   'parch']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['gender', 'cl...
                               feature_types=None, gamma=None, grow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, multi_strategy=None,
                               n_estimators=None, n_jobs=None,
                               num_parallel_tree=None, random_state=None, ...))])

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare', 'sibsp',
                                                   'parch']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['gender', 'cl...
                               feature_types=None, gamma=None, grow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, multi_strategy=None,
                               n_estimators=None, n_jobs=None,
                               num_parallel_tree=None, random_state=None, ...))])

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'fare', 'sibsp', 'parch']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['gender', 'class', 'embarked'])])

['age', 'fare', 'sibsp', 'parch']

SimpleImputer(strategy='median')

StandardScaler()

['gender', 'class', 'embarked']

SimpleImputer(fill_value='missing', strategy='constant')

OneHotEncoder(handle_unknown='ignore')

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)

exp = dx.Explainer(clf, X, y)

Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x1176dc400> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.000258, mean = 0.322, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.756, mean = 0.000108, max = 0.967
  -> model_info        : package sklearn

A new explainer has been created!

exp.model_performance(model_type='classification').plot(geom='roc')

exp.model_parts().plot()

Non standard use¶

Explainer needs to know how the predictions are made. Its constructor tries to find out it by itself. However, it is not always possible. For example, if your model is wrapped with something unknown, with nonstandard behaviour you have to tell how to treat this model.

The easiest way is to pass model_class parameter telling what's the model actual interface.

Easy (and a bit fake) example¶

This example has no use, though it is easy to understand and use as a framework for more practical examples.

class Wrapper:
    def __init__(self, model):
        self.model = model
        
    def predict(self, dmatrix):
        return self.model.predict(dmatrix)

params = {
    "max_depth": 5,
    "objective": "binary:logistic",
    "eval_metric": "auc"
}

data = dx.datasets.load_titanic()

X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived

train = xgb.DMatrix(X, label=y)
inner_model = xgb.train(params, train, verbose_eval=1)

wrapped = Wrapper(classifier)

type(inner_model)

xgboost.core.Booster

type(wrapped)

__main__.Wrapper

Such model has non standard interface because it need DMatrix. Note that class of this model is __main__.Wrapper thus Explainer will use default predict methods, which causes an error.

exp = dx.Explainer(wrapped, X, y)

Preparation of a new explainer is initiated

  -> data              : 2207 rows 4 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : __main__.Wrapper (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x1176dc360> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         :  'residual_function' returns an Error when executed:
('Expecting data to be a DMatrix object, got: ', <class 'pandas.core.frame.DataFrame'>)
  -> model_info        : package __main__

A new explainer has been created!

However the inner model: xgb.train is known to the Explainer so you can just inform it. You will get valid explainer that can be used normally from now on.

exp = dx.Explainer(wrapped, X, y, model_class='xgboost.core.Booster')

Preparation of a new explainer is initiated

  -> data              : 2207 rows 4 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : xgboost.core.Booster
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_xgboost at 0x1176dc4a0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0498, mean = 0.322, max = 0.893
  -> model type        : 'model_type' not provided and cannot be extracted.
  -> model type        : Some functionalities won't be available.
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.787, mean = -0.000182, max = 0.857
  -> model_info        : package __main__

A new explainer has been created!

exp.model_profile().plot()

Calculating ceteris paribus: 100%|██████████| 4/4 [00:00<00:00, 245.14it/s]

exp.model_info

{'model_package': '__main__',
 'model_class_default': False,
 'label_default': True,
 'predict_function_default': True,
 'arrays_accepted': False,
 'residual_function_default': True}

exp.model_class

'xgboost.core.Booster'

Easy (and a bit fake) example 2¶

If your base model is not known to the Explainer you can pass a special argument predict_function to tell how to handle it. We will use exactly the same fake model from the previous example.

class Wrapper:
    def __init__(self, model):
        self.model = model
        
    def predict(self, dmatrix):
        return self.model.predict(dmatrix)

params = {
    "max_depth": 5,
    "objective": "binary:logistic",
    "eval_metric": "auc"
}

data = dx.datasets.load_titanic()

X = data.drop(columns='survived').loc[:, ['age', 'fare', 'sibsp', 'parch']]
y = data.survived

train = xgb.DMatrix(X, label=y)
inner_model = xgb.train(params, train, verbose_eval=1)

wrapped = Wrapper(classifier)

exp = dx.Explainer(wrapped, X, y, predict_function=lambda m, d: m.predict(xgb.DMatrix(d)))

Preparation of a new explainer is initiated

  -> data              : 2207 rows 4 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : __main__.Wrapper (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function <lambda> at 0x177c04040> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0498, mean = 0.322, max = 0.893
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.787, mean = -0.000182, max = 0.857
  -> model_info        : package __main__

A new explainer has been created!

exp.predict_parts(X.iloc[44, :]).plot(min_max=[0,1])

exp.model_info

{'model_package': '__main__',
 'Pipeline': False,
 'model_class_default': True,
 'label_default': True,
 'predict_function_default': False,
 'arrays_accepted': False,
 'model_type_default': True,
 'residual_function_default': True}

exp.model_class

'__main__.Wrapper'

Plots¶

This package uses plotly to render the plots:

Install extentions to use plotly in JupyterLab: Getting Started Troubleshooting
Use show=False parameter in plot method to return plotly Figure object
It is possible to edit the figures and save them

Resources - https://dalex.drwhy.ai/python ¶

Introduction to the dalex package: Titanic: tutorial and examples
Key features explained: FIFA20: explain default vs tuned model with dalex
How to use dalex with: xgboost, tensorflow, h2o (feat. autokeras, catboost, lightgbm)
More explanations: residuals, shap, lime
Introduction to the Fairness module in dalex
Introduction to the Aspect module in dalex
Introduction to Arena: interactive dashboard for model exploration
Code in the form of jupyter notebook
Changelog: NEWS
Theoretical introduction to the plots: Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models