In the real world, we come across data with dependencies. It is almost impossible to avoid dependence among predictors when building predictive models.

Unfortunately, many commonly used explainable artificial intelligence (XAI) methods ignore these dependencies, often assuming independence of variables (permutation methods), which leads to unrealistic settings and misleading explanations.

Problems with explaining models based on correlated data is one of the pitfalls described in General Pitfalls of Model-Agnostic Interpretation Methods for Machine Learning Models.

We propose a way in which ML engineers can explain their models taking into account the dependencies between the variables. The first part of the module are functionalities that enable estimating the importance and contribution of variables by grouping them in so called **aspects**. It is a method inpired by Triplot paper.

In [1]:

```
import dalex as dx
import numpy as np
```

In [2]:

```
dx.__version__
```

Out[2]:

To showcase the abilities of the module, we will be using the German Credit Data dataset) to assign risk for each credit-seeker.

In [3]:

```
# read data and create model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
# credit data
data = dx.datasets.load_german()
# risk is the target
X = data.drop(columns='risk')
y = data.risk
categorical_features = ['sex', 'job', 'housing', 'saving_accounts', 'checking_account', 'purpose']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_features = ['age', 'duration', 'credit_amount']
numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features),
('num', numerical_transformer, numerical_features)
])
classifier = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', classifier)
])
clf.fit(X, y)
```

Out[3]:

We already have the model, time to explain it - we create an `Explainer`

object.

In [4]:

```
exp = dx.Explainer(clf, X, y)
```

`Aspect`

- finding dependencies¶Now we create an `Aspect`

object on the basis of an explainer. It enables the use of the `dalex`

functionalities related to explanations in groups of dependent variables (aspects).

The `Aspect`

object itself contains information about the dependencies between variables and their hierarchical clustering into aspects.

It is possible to choose the method of calculating the dependencies. By default, the so called `association`

method is used, which consists in the use of statistical coefficients:

- for two numerical variables: the association is the absolute value from the Spearman's rank correlation coefficient;
- for two categorical variables: the association is the value of CramÃ©râ€™s $V$ with bias correction (based on Pearsonâ€™s chi-squared statistic);
- for one numerical and one categorical variable: the association is the value of eta-squared $\eta^2$ (based on H-statistic from Kruskal-Wallis test).

The user can also use the `pps`

method - Power Predictive Score measure or provide their own method. It is worth noting that PPS is a more restrictive measure, i.e., it trims the less significant (often noise-related) dependencies to 0.

We will check what the variable hierarchical clustering looks like with both available methods.

In [5]:

```
asp = dx.Aspect(exp)
```

In [6]:

```
asp_pps = dx.Aspect(exp, depend_method = 'pps')
```

In [7]:

```
asp.plot_dendrogram(title='Hierarchical clustering dendrogram (with association)')
```