In the real world, we come across data with dependencies. It is almost impossible to avoid dependence among predictors when building predictive models.
Unfortunately, many commonly used explainable artificial intelligence (XAI) methods ignore these dependencies, often assuming independence of variables (permutation methods), which leads to unrealistic settings and misleading explanations.
Problems with explaining models based on correlated data is one of the pitfalls described in General Pitfalls of Model-Agnostic Interpretation Methods for Machine Learning Models.
We propose a way in which ML engineers can explain their models taking into account the dependencies between the variables. The first part of the module are functionalities that enable estimating the importance and contribution of variables by grouping them in so called aspects. It is a method inpired by Triplot paper.
import dalex as dx
import numpy as np
dx.__version__
To showcase the abilities of the module, we will be using the German Credit Data dataset) to assign risk for each credit-seeker.
# read data and create model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
# credit data
data = dx.datasets.load_german()
# risk is the target
X = data.drop(columns='risk')
y = data.risk
categorical_features = ['sex', 'job', 'housing', 'saving_accounts', 'checking_account', 'purpose']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_features = ['age', 'duration', 'credit_amount']
numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features),
('num', numerical_transformer, numerical_features)
])
classifier = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', classifier)
])
clf.fit(X, y)
We already have the model, time to explain it - we create an Explainer
object.
exp = dx.Explainer(clf, X, y)
Aspect
- finding dependencies¶Now we create an Aspect
object on the basis of an explainer. It enables the use of the dalex
functionalities related to explanations in groups of dependent variables (aspects).
The Aspect
object itself contains information about the dependencies between variables and their hierarchical clustering into aspects.
It is possible to choose the method of calculating the dependencies. By default, the so called association
method is used, which consists in the use of statistical coefficients:
The user can also use the pps
method - Power Predictive Score measure or provide their own method. It is worth noting that PPS is a more restrictive measure, i.e., it trims the less significant (often noise-related) dependencies to 0.
We will check what the variable hierarchical clustering looks like with both available methods.
asp = dx.Aspect(exp)
asp_pps = dx.Aspect(exp, depend_method = 'pps')
asp.plot_dendrogram(title='Hierarchical clustering dendrogram (with association)')