In this short tutorial, we show how to check if the regression model discriminates a particular subgroup using the dalex
package.
This approach is experimental and we are grateful for all the feedback. It was implemented according to Steinberg, D., et al. (2020).
This notebook aims to show how to detect the bias in regression models. It won't cover the fairness concepts and interpretation of the plots in details. For starters it is best to get familiar with our fairness in classification materials:
import pandas as pd
import numpy as np
{"pd": pd.__version__, "np": np.__version__}
We use the Communitties and Crime data from the paper and aim to predict the ViolentCrimesPerPop variable (total number of violent crimes per 100K population).
The protected attribute is the racepctblack value (part of the population identifying as black), which is the same one picked by the paper's authors.
data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data", header=None, na_values=["?"])
from urllib.request import urlopen
names = urlopen("http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names")
columns = [line.split(b' ')[1].decode("utf-8") for line in names if line.startswith(b'@attribute')]
data.columns = columns
data = data.dropna(axis = 1)
data = data.iloc[:, 3:]
data.head()
X = data.drop('ViolentCrimesPerPop', axis=1)
y = data.ViolentCrimesPerPop
We make two regressor models: a simple and interpretable Decision Tree
and a more complex and accurate Gradient Boosting
.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = GradientBoostingRegressor()
model.fit(X_train, y_train)
model2 = DecisionTreeRegressor()
model2.fit(X_train, y_train)
In the next step we make the Explainer
objects using dalex
.
import dalex as dx
print(dx.__version__)
exp = dx.Explainer(model, X_test, y_test, verbose=False)
exp2 = dx.Explainer(model2, X_test, y_test, verbose=False)
exp.model_performance().result.append(exp2.model_performance().result)
Having Explainers
, we are able to assess models' fairness. To make sure that the models are fair, we will be checking three independence criteria. These are:
Where:
In the approach described in Steinberg, D., et al. (2020), the authors propose a way of checking this independence.
The method implemented in the dalex
package is called Direct Density Ratio Estimation.
protected = np.where(X_test.racepctblack >= 0.5, 'majority_black', "else")
privileged = 'else'
fobject = exp.model_fairness(protected, privileged)
fobject2 = exp2.model_fairness(protected, privileged)
fobject.fairness_check()
fobject2.fairness_check()
Decision Tree
model violated 3 criteria while Gradient Boosting
violated only 2. We can plot the fairness check in the same way as for classification.
fobject2.plot()