Aspect module in dalex

In the real world, we come across data with dependencies. It is almost impossible to avoid dependence among predictors when building predictive models.

Unfortunately, many commonly used explainable artificial intelligence (XAI) methods ignore these dependencies, often assuming independence of variables (permutation methods), which leads to unrealistic settings and misleading explanations.

Problems with explaining models based on correlated data is one of the pitfalls described in General Pitfalls of Model-Agnostic Interpretation Methods for Machine Learning Models.

We propose a way in which ML engineers can explain their models taking into account the dependencies between the variables. The first part of the module are functionalities that enable estimating the importance and contribution of variables by grouping them in so called aspects. It is a method inpired by Triplot paper.

In [1]:
import dalex as dx
import numpy as np
In [2]:

Case study - german credit data

To showcase the abilities of the module, we will be using the German Credit Data dataset) to assign risk for each credit-seeker.

In [3]:
# read data and create model

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

# credit data
data = dx.datasets.load_german()

# risk is the target
X = data.drop(columns='risk')
y = data.risk

categorical_features = ['sex', 'job', 'housing', 'saving_accounts', 'checking_account', 'purpose']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))

numerical_features = ['age', 'duration', 'credit_amount']
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())

preprocessor = ColumnTransformer(transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numerical_transformer, numerical_features)

classifier = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', classifier)
]), y)
                                                  ['sex', 'job', 'housing',
                                                  ['age', 'duration',
                 RandomForestClassifier(max_depth=5, random_state=42))])

We already have the model, time to explain it - we create an Explainer object.

In [4]:
exp = dx.Explainer(clf, X, y)
Preparation of a new explainer is initiated

  -> data              : 1000 rows 9 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 1000 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x0000026C7DB23700> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.264, mean = 0.701, max = 0.919
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.893, mean = -0.00121, max = 0.578
  -> model_info        : package sklearn

A new explainer has been created!

Creating Aspect - finding dependencies

Now we create an Aspect object on the basis of an explainer. It enables the use of the dalex functionalities related to explanations in groups of dependent variables (aspects).

The Aspect object itself contains information about the dependencies between variables and their hierarchical clustering into aspects.

It is possible to choose the method of calculating the dependencies. By default, the so called association method is used, which consists in the use of statistical coefficients:

  • for two numerical variables: the association is the absolute value from the Spearman's rank correlation coefficient;
  • for two categorical variables: the association is the value of Cramér’s $V$ with bias correction (based on Pearson’s chi-squared statistic);
  • for one numerical and one categorical variable: the association is the value of eta-squared $\eta^2$ (based on H-statistic from Kruskal-Wallis test).

The user can also use the pps method - Power Predictive Score measure or provide their own method. It is worth noting that PPS is a more restrictive measure, i.e., it trims the less significant (often noise-related) dependencies to 0.

We will check what the variable hierarchical clustering looks like with both available methods.

In [5]:
asp = dx.Aspect(exp)
In [6]:
asp_pps = dx.Aspect(exp, depend_method = 'pps')
In [7]:
asp.plot_dendrogram(title='Hierarchical clustering dendrogram (with association)')