dalex - new features

imports

In [1]:
import dalex as dx 

import numpy as np
import pandas as pd

from lightgbm import LGBMRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')
In [2]:
dx.__version__
Out[2]:
'0.2.1'

prepare data

Transform the skewed target variable (y) for better model fit.

In [3]:
data = dx.datasets.load_fifa()
X = data.drop(["nationality", "overall", "potential", "value_eur", "wage_eur"], axis = 1)
y = data['value_eur']

ylog = np.log(y)

create models

Use Pipeline to scale the data.

In [4]:
model_svm = Pipeline(steps=[('scale', StandardScaler()),
                            ('model', SVR(C=10, epsilon=0.2, tol=1e-4))])
model_svm.fit(X, ylog)
Out[4]:
Pipeline(steps=[('scale', StandardScaler()),
                ('model', SVR(C=10, epsilon=0.2, tol=0.0001))])
In [5]:
model_gbm = LGBMRegressor(n_estimators=200, max_depth=10, learning_rate=0.15, random_state=0)
model_gbm.fit(X, ylog)
Out[5]:
LGBMRegressor(learning_rate=0.15, max_depth=10, n_estimators=200,
              random_state=0)

predict_function

Because we transformed the the target, we want to change the default predict_function to return a real y value.

In [6]:
def predict_function(model, data):
    return np.exp(model.predict(data))

create an explainer for the model

Explainer prints useful information, especially for resolving potential errors.

In [7]:
exp_svm = dx.Explainer(model_svm, data=X, y=y,  predict_function=predict_function, label='svm')
exp_gbm = dx.Explainer(model_gbm, data=X, y=y, predict_function=predict_function, label='gbm')
Preparation of a new explainer is initiated

  -> data              : 5000 rows 37 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 5000 values
  -> model_class       : sklearn.svm._classes.SVR (default)
  -> label             : svm
  -> predict function  : <function predict_function at 0x0000020941360A60> will be used
  -> model type        : regression will be used (default)
  -> predicted values  : min = 2.2e+05, mean = 7.25e+06, max = 8.69e+07
  -> predict function  : accepts pandas.DataFrame and numpy.ndarray
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.53e+07, mean = 2.19e+05, max = 1.86e+07
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 5000 rows 37 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 5000 values
  -> model_class       : lightgbm.sklearn.LGBMRegressor (default)
  -> label             : gbm
  -> predict function  : <function predict_function at 0x0000020941360A60> will be used
  -> model type        : regression will be used (default)
  -> predicted values  : min = 2.01e+05, mean = 7.43e+06, max = 1.04e+08
  -> predict function  : accepts pandas.DataFrame and numpy.ndarray
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -6e+06, mean = 4.17e+04, max = 8.85e+06
  -> model_info        : package lightgbm

A new explainer has been created!

model_performance allows for easy model comparison.

In [8]:
pd.concat((exp_svm.model_performance().result, exp_gbm.model_performance().result))
Out[8]:
mse rmse r2 mae mad
svm 2.907931e+12 1.705266e+06 0.963016 950656.413963 531411.484924
gbm 4.142691e+11 6.436374e+05 0.994731 357519.691396 203774.166420

dalex functions

image.png

Above functions and more are accessible from the Explainer object through its methods.

Each of them returns a new unique object that contains a result field in the form of a pandas.DataFrame and a plot method.

New features

shap wrapper

predict_parts and model_parts have new type='shap_wrapper' which uses the shap package to produce shap values explanations.

In [9]:
pp = exp_gbm.predict_parts(X.iloc[[1]], type='shap_wrapper', shap_explainer_type="TreeExplainer")
type(pp)
Out[9]:
dalex.wrappers._shap.object.ShapWrapper
In [10]:
pp.plot()
In [11]:
pp.result  # shap_values
Out[11]:
array([[-7.54586630e-01,  1.17727113e-02, -4.02593559e-03,
         7.14929823e-02,  3.52435672e-01,  1.32079014e-01,
         2.29188280e-01,  1.95115999e-02,  1.50769907e-01,
        -1.89542877e-03,  1.38379770e-02,  2.17200109e-02,
         5.50789133e-01,  4.57411758e-02,  1.71412903e-01,
         5.44978447e-03,  1.03454365e+00,  1.57545089e-02,
         6.94463869e-02,  2.01840435e-02,  3.95924427e-02,
         1.08291015e-02,  3.87196151e-02, -6.42074546e-03,
        -2.40633717e-03,  1.95738768e-01,  6.27653692e-02,
        -7.63332380e-03,  2.63466280e-02,  9.61535713e-03,
        -2.60876090e-02, -6.15860194e-03, -4.33178149e-03,
        -6.24827046e-03,  2.75119280e-03, -7.79635820e-04,
        -6.04760409e-03]])
In [12]:
mp = exp_gbm.model_parts(type='shap_wrapper', shap_explainer_type="TreeExplainer")
type(mp)
100%|===================| 996/1000 [00:28<00:00]        
Out[12]:
dalex.wrappers._shap.object.ShapWrapper
In [13]:
mp.plot()
In [14]:
mp.plot(plot_type='bar')
In [15]:
mp.result  # shap_values
Out[15]:
array([[-0.09194286,  0.01268505, -0.00433844, ..., -0.00288738,
        -0.00698915, -0.00871489],
       [-0.20522728,  0.01173569, -0.01236051, ...,  0.00182615,
        -0.00651697,  0.00860811],
       [-0.05913406, -0.00342238, -0.00426941, ..., -0.0012598 ,
        -0.00647394, -0.00500574],
       ...,
       [ 0.1861156 , -0.00882657, -0.00470069, ..., -0.0020923 ,
        -0.00232821, -0.00306167],
       [-0.11802352,  0.00800961,  0.01267808, ..., -0.00480292,
        -0.00689434, -0.00695982],
       [ 0.07008078, -0.00760733, -0.00859892, ..., -0.00235196,
        -0.0032801 , -0.0061742 ]])

model_diagnostics

New model_diagnostics method allows for Residual Diagnostics.

In [16]:
md_svm = exp_svm.model_diagnostics()
md_gbm = exp_gbm.model_diagnostics()
md_svm.plot(md_gbm, variable='age', yvariable='residuals', marker_size=5)