In the real world, we come across data with dependencies. It is almost impossible to avoid dependence among predictors when building predictive models.
Unfortunately, many commonly used explainable artificial intelligence (XAI) methods ignore these dependencies, often assuming independence of variables (permutation methods), which leads to unrealistic settings and misleading explanations.
Problems with explaining models based on correlated data is one of the pitfalls described in General Pitfalls of Model-Agnostic Interpretation Methods for Machine Learning Models.
We propose a way in which ML engineers can explain their models taking into account the dependencies between the variables. The first part of the module are functionalities that enable estimating the importance and contribution of variables by grouping them in so called aspects. It is a method inpired by Triplot paper.
import dalex as dx import numpy as np
# read data and create model from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.ensemble import RandomForestClassifier # credit data data = dx.datasets.load_german() # risk is the target X = data.drop(columns='risk') y = data.risk categorical_features = ['sex', 'job', 'housing', 'saving_accounts', 'checking_account', 'purpose'] categorical_transformer = Pipeline(steps=[ ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) numerical_features = ['age', 'duration', 'credit_amount'] numerical_transformer = Pipeline(steps=[ ('scaler', StandardScaler()) ]) preprocessor = ColumnTransformer(transformers=[ ('cat', categorical_transformer, categorical_features), ('num', numerical_transformer, numerical_features) ]) classifier = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42) clf = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', classifier) ]) clf.fit(X, y)
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('cat', Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))]), ['sex', 'job', 'housing', 'saving_accounts', 'checking_account', 'purpose']), ('num', Pipeline(steps=[('scaler', StandardScaler())]), ['age', 'duration', 'credit_amount'])])), ('classifier', RandomForestClassifier(max_depth=5, random_state=42))])
We already have the model, time to explain it - we create an
exp = dx.Explainer(clf, X, y)
Preparation of a new explainer is initiated -> data : 1000 rows 9 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 1000 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x0000026C7DB23700> will be used (default) -> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems. -> predicted values : min = 0.264, mean = 0.701, max = 0.919 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.893, mean = -0.00121, max = 0.578 -> model_info : package sklearn A new explainer has been created!
Aspect- finding dependencies¶
Now we create an
Aspect object on the basis of an explainer. It enables the use of the
dalex functionalities related to explanations in groups of dependent variables (aspects).
Aspect object itself contains information about the dependencies between variables and their hierarchical clustering into aspects.
It is possible to choose the method of calculating the dependencies. By default, the so called
association method is used, which consists in the use of statistical coefficients:
The user can also use the
pps method - Power Predictive Score measure or provide their own method. It is worth noting that PPS is a more restrictive measure, i.e., it trims the less significant (often noise-related) dependencies to 0.
We will check what the variable hierarchical clustering looks like with both available methods.
asp = dx.Aspect(exp)
asp_pps = dx.Aspect(exp, depend_method = 'pps')
asp.plot_dendrogram(title='Hierarchical clustering dendrogram (with association)')