Advanced tutorial on bias detection in dalex

In this tutorial we will cover more advanced aspects of fairness module in dalex. For the introduction, see Fairness module in dalex example.

In [1]:
import pandas as pd 
import numpy as np 
{"pd": pd.__version__, "np": np.__version__}
Out[1]:
{'pd': '1.1.4', 'np': '1.19.3'}

Data

Firstly, we load the data. This tutorial will be based on the famous ProPublica study on compas recidivism algorithm.

In [2]:
compas = pd.read_csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv")

To get a clearer picture, we will only use a few columns of the original data frame.

In [3]:
compas = compas[["sex", "age", "age_cat", "race", "juv_fel_count", "juv_misd_count",
                 "juv_other_count", "priors_count", "c_charge_degree", "is_recid",
                 "is_violent_recid", "two_year_recid"]]
compas.head() 
Out[3]:
sex age age_cat race juv_fel_count juv_misd_count juv_other_count priors_count c_charge_degree is_recid is_violent_recid two_year_recid
0 Male 69 Greater than 45 Other 0 0 0 0 F 0 0 0
1 Male 34 25 - 45 African-American 0 0 0 0 F 1 1 1
2 Male 24 Less than 25 African-American 0 0 1 4 F 1 0 1
3 Male 23 Less than 25 African-American 0 1 0 1 F 0 0 0
4 Male 43 25 - 45 Other 0 0 0 2 F 0 0 0

As we can see, we have a relatively compact pandas.DataFrame. The target variable is two_year_recid which denotes if a particular person will re-offend in the next two years. For this tutorial, we will use scikit-learn models, but these methods are model-agnostic.

In [4]:
age_cat = compas.age_cat
compas = compas.drop("age_cat", axis =1)
In [5]:
compas.dtypes
Out[5]:
sex                 object
age                  int64
race                object
juv_fel_count        int64
juv_misd_count       int64
juv_other_count      int64
priors_count         int64
c_charge_degree     object
is_recid             int64
is_violent_recid     int64
two_year_recid       int64
dtype: object

Models

Like in the previous example, we will make 3 basic predictive models without the hyperparameter tuning.

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(compas.drop("two_year_recid", axis =1),
                                                    compas.two_year_recid,
                                                    test_size=0.3,
                                                    random_state=123)

categorical_features = ['sex', 'race', 'c_charge_degree']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

numerical_features = ["age", "priors_count"]
numerical_transformer = Pipeline(steps=[
    ('scale', StandardScaler())
])

preprocessor = ColumnTransformer(transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numerical_transformer, numerical_features)
])

clf_tree = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(max_depth=7, random_state=123))
])

clf_forest = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=200, max_depth=7, random_state=123))
])

clf_logreg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])


clf_logreg.fit(X_train, y_train)
clf_forest.fit(X_train, y_train)
clf_tree.fit(X_train, y_train)
Out[6]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['sex', 'race',
                                                   'c_charge_degree']),
                                                 ('num',
                                                  Pipeline(steps=[('scale',
                                                                   StandardScaler())]),
                                                  ['age', 'priors_count'])])),
                ('classifier',
                 DecisionTreeClassifier(max_depth=7, random_state=123))])

Explainers

Now, we need to create Explainer objects with the help of dalex.

In [7]:
import dalex as dx
dx.__version__
Out[7]:
'1.0.0'
In [8]:
exp_logreg = dx.Explainer(clf_logreg, X_test, y_test)
Preparation of a new explainer is initiated

  -> data              : 2165 rows 10 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2165 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x000001FEFFC364C0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0421, mean = 0.444, max = 0.988
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.978, mean = 2.17e-05, max = 0.92
  -> model_info        : package sklearn

A new explainer has been created!
In [9]:
exp_tree = dx.Explainer(clf_tree, X_test, y_test, verbose=False)
exp_forest = dx.Explainer(clf_forest, X_test, y_test, verbose=False)

model performance

In [10]:
pd.concat([exp.model_performance().result for exp in [exp_logreg, exp_tree, exp_forest]])
Out[10]:
recall precision f1 accuracy auc
LogisticRegression 0.545265 0.659119 0.596811 0.672979 0.706219
DecisionTreeClassifier 0.516129 0.639175 0.571100 0.655889 0.688296
RandomForestClassifier 0.546306 0.649752 0.593556 0.667898 0.712737

permutation-based variable importance

Now, we will use one of the dalex methods to assess the Variable Importance of these models

In [11]:
exp_tree.model_parts().plot(objects=[exp_forest.model_parts(), exp_logreg.model_parts()])