In this tutorial, we cover the more advanced aspects of the fairness module in dalex
. For the introduction, see Fairness module in dalex example.
import pandas as pd
import numpy as np
{"pd": pd.__version__, "np": np.__version__}
Firstly we load the data, which is based on the famous ProPublica study on the COMPAS recidivism algorithm.
compas = pd.read_csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv")
To get a clearer picture, we will only use a few columns of the original data frame.
compas = compas[["sex", "age", "age_cat", "race", "juv_fel_count", "juv_misd_count",
"juv_other_count", "priors_count", "c_charge_degree", "is_recid",
"is_violent_recid", "two_year_recid"]]
compas.head()
As we can see, we have a relatively compact pandas.DataFrame
. The target variable is two_year_recid which denotes if a particular person will re-offend in the next two years. For this tutorial, we will use scikit-learn
models, but these methods are model-agnostic.
age_cat = compas.age_cat
compas = compas.drop("age_cat", axis =1)
compas.dtypes
Like in the previous example, we will make 3 basic predictive models without the hyperparameter tuning.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(compas.drop("two_year_recid", axis =1),
compas.two_year_recid,
test_size=0.3,
random_state=123)
categorical_features = ['sex', 'race', 'c_charge_degree']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numerical_features = ["age", "priors_count"]
numerical_transformer = Pipeline(steps=[
('scale', StandardScaler())
])
preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, categorical_features),
('num', numerical_transformer, numerical_features)
])
clf_tree = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', DecisionTreeClassifier(max_depth=7, random_state=123))
])
clf_forest = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=200, max_depth=7, random_state=123))
])
clf_logreg = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
clf_logreg.fit(X_train, y_train)
clf_forest.fit(X_train, y_train)
clf_tree.fit(X_train, y_train)
Now, we need to create Explainer
objects with the help of dalex.
import dalex as dx
dx.__version__
exp_logreg = dx.Explainer(clf_logreg, X_test, y_test)
exp_tree = dx.Explainer(clf_tree, X_test, y_test, verbose=False)
exp_forest = dx.Explainer(clf_forest, X_test, y_test, verbose=False)
pd.concat([exp.model_performance().result for exp in [exp_logreg, exp_tree, exp_forest]])
Now, we will use one of the dalex
methods to assess the Variable Importance of these models
exp_tree.model_parts().plot(objects=[exp_forest.model_parts(), exp_logreg.model_parts()])