Arena for Python

Load Data

In [1]:
import dalex as dx

import warnings
warnings.filterwarnings('ignore')

dx.__version__
Out[1]:
'1.4.0'
In [2]:
train = dx.datasets.load_apartments()
test = dx.datasets.load_apartments_test()

X_train = train.drop(columns='m2_price')
y_train = train["m2_price"]

X_test= test.drop(columns='m2_price')
y_test = test["m2_price"]

Preprocessing

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
numerical_features = X_train.select_dtypes(exclude=[object]).columns
numerical_transformer = Pipeline(
    steps=[
        ('scaler', StandardScaler())
    ]
)

categorical_features = X_train.select_dtypes(include=[object]).columns
categorical_transformer = Pipeline(
    steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Fit models

In [4]:
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor

model_elastic_net = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', ElasticNet(alpha=0.2))
    ]
)
model_elastic_net.fit(X=X_train, y=y_train)

model_decision_tree = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', DecisionTreeRegressor())
    ]
)
model_decision_tree.fit(X=X_train, y=y_train)
Out[4]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['district'], dtype='object'))])),
                ('model', DecisionTreeRegressor())])

Create dalex Explainer for each model

In [5]:
exp_elastic_net = dx.Explainer(model_elastic_net, data=X_test, y=y_test)
exp_decision_tree = dx.Explainer(model_decision_tree, data=X_test, y=y_test)
Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : sklearn.linear_model._coordinate_descent.ElasticNet (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x000001B510407EE0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 2.08e+03, mean = 3.5e+03, max = 5.4e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -5.9e+02, mean = 6.8, max = 1.39e+03
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : sklearn.tree._classes.DecisionTreeRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x000001B510407EE0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.62e+03, mean = 3.51e+03, max = 6.6e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.19e+03, mean = 5.9, max = 1.09e+03
  -> model_info        : package sklearn

A new explainer has been created!
In [6]:
exp_elastic_net.model_performance()
Out[6]:
mse rmse r2 mae mad
ElasticNet 197572.651052 444.491452 0.756325 358.637412 393.033877
In [7]:
exp_decision_tree.model_performance()
Out[7]:
mse rmse r2 mae mad
DecisionTreeRegressor 53504.216 231.309784 0.934011 148.770222 81.0

Arena features

Live mode using all available observations

In [8]:
# create empty Arena
arena = dx.Arena()
# push created explainer
arena.push_model(exp_elastic_net)
# push whole test dataset (including target column)
arena.push_observations(test)
# run server on port 9294
arena.run_server(port=9294)
https://arena.drwhy.ai/?data=http://127.0.0.1:9294/

The server is updating automatically. One can add the second model while it is running.

In [9]:
arena.push_model(exp_decision_tree)

And a third one!

In [10]:
from lightgbm import LGBMRegressor
model_gbm = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', LGBMRegressor())
    ]
)
model_gbm.fit(X=X_train, y=y_train)
exp_gbm = dx.Explainer(model_gbm, data=X_test, y=y_test)
arena.push_model(exp_gbm)
Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : lightgbm.sklearn.LGBMRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x000001B510407EE0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.63e+03, mean = 3.5e+03, max = 6.43e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -5.02e+02, mean = 8.94, max = 7.27e+02
  -> model_info        : package sklearn

A new explainer has been created!

Stop the server using this method:

In [11]:
arena.stop_server()

Static mode using a subset of observations

Create an Arena exacly the same way.

In [12]:
# create empty Arena
arena = dx.Arena()
# this takes too long to compute
arena.set_option('DatasetShapleyValues', 'N', 10) 
# push created explainers
arena.push_model(exp_gbm)
arena.push_model(exp_decision_tree)
# push first 3 rows of tasting dataset
arena.push_observations(test.iloc[0:3])
# save arena to file
arena.save("data.json")
Shapley Values: 100%|██████████| 6/6 [02:03<00:00, 20.54s/it]
Variable Importance: 100%|██████████| 2/2 [00:03<00:00,  1.57s/it]
Partial Dependence: 100%|██████████| 10/10 [00:00<00:00, 10.41it/s]
Accumulated Dependence: 100%|██████████| 10/10 [00:02<00:00,  4.66it/s]
Ceteris Paribus: 100%|██████████| 30/30 [00:00<00:00, 50.68it/s]
Break Down: 100%|██████████| 6/6 [00:01<00:00,  4.34it/s]
Metrics: 100%|██████████| 2/2 [00:00<00:00, 333.37it/s]
Shapley Values Dependence: 100%|██████████| 10/10 [01:19<00:00,  7.92s/it]
Shapley Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 22.99it/s]

You can automatically upload this data source to the GitHub Gist service. By default, OAuth is used, but you can provide a Personal Access Token using the token argument.

In [13]:
arena.upload(open_browser=False)
Shapley Values: 100%|██████████| 6/6 [00:00<?, ?it/s]
Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 999.24it/s]
Partial Dependence: 100%|██████████| 10/10 [00:00<00:00, 9993.58it/s]
Accumulated Dependence: 100%|██████████| 10/10 [00:00<00:00, 10000.72it/s]
Ceteris Paribus: 100%|██████████| 30/30 [00:00<00:00, 30073.88it/s]
Break Down: 100%|██████████| 6/6 [00:00<00:00, 6013.34it/s]
Metrics: 100%|██████████| 2/2 [00:00<?, ?it/s]
Shapley Values Dependence: 100%|██████████| 10/10 [00:00<00:00, 9953.26it/s]
Shapley Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 2004.45it/s]
Out[13]:
'https://arena.drwhy.ai/?data=https://gist.githubusercontent.com/hbaniecki/ceb167ab41142b29c09b5a5b399c12a8/raw/57b2cad7e82fda8abdd7edcbe8afa3ace858e3d9/datasource.json'
In [14]:
arena=dx.Arena()
arena.push_model(exp_decision_tree)
arena.push_observations(test)
arena.run_server(port=9294)

arena.print_options()
https://arena.drwhy.ai/?data=http://127.0.0.1:9294/

ShapleyValues
---------------------------------
B: 20   #Number of random paths
cpus: 4   #Number of parallel processes

VariableImportance
---------------------------------
N: None   #Number of observations to use. None for all.
B: 10   #Number of permutation rounds to perform each variable

PartialDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 101   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

AccumulatedDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 101   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

CeterisParibus
---------------------------------
grid_points: 101   #Maximum number of points for profile
grid_type: quantile   #grid type "quantile" or "uniform"

ROC
---------------------------------
grid_points: 101   #Maximum number of points for ROC curve

Fairness
---------------------------------
cutoffs: [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]   #List of tested cutoff levels

DatasetShapleyValues
---------------------------------
B: 4   #Number of random paths
N: 500   #Number of randomly sampled rows from dataset
cpus: 4   #Number of parallel processes

You can easily change options for charts and the dashboard will be automatically refreshed.

In [15]:
# Chart-specific
arena.set_option('CeterisParibus', 'grid_type', 'uniform')
# For all charts
arena.set_option(None, 'grid_points', 200)
In [16]:
arena.print_options()
ShapleyValues
---------------------------------
B: 20   #Number of random paths
cpus: 4   #Number of parallel processes

VariableImportance
---------------------------------
N: None   #Number of observations to use. None for all.
B: 10   #Number of permutation rounds to perform each variable

PartialDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 200   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

AccumulatedDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 200   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

CeterisParibus
---------------------------------
grid_points: 200   #Maximum number of points for profile
grid_type: uniform   #grid type "quantile" or "uniform"

ROC
---------------------------------
grid_points: 200   #Maximum number of points for ROC curve

Fairness
---------------------------------
cutoffs: [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]   #List of tested cutoff levels

DatasetShapleyValues
---------------------------------
B: 4   #Number of random paths
N: 500   #Number of randomly sampled rows from dataset
cpus: 4   #Number of parallel processes

Plots

This package uses plotly to render the plots:

Resources - https://dalex.drwhy.ai/python