Arena for Python

Load Data

In [1]:
import dalex as dx

import warnings
warnings.filterwarnings('ignore')

import plotly
plotly.offline.init_notebook_mode()

dx.__version__
Out[1]:
'1.7.0'
In [2]:
train = dx.datasets.load_apartments()
test = dx.datasets.load_apartments_test()

X_train = train.drop(columns='m2_price')
y_train = train["m2_price"]

X_test= test.drop(columns='m2_price')
y_test = test["m2_price"]

Preprocessing

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
numerical_features = X_train.select_dtypes(exclude=[object]).columns
numerical_transformer = Pipeline(
    steps=[
        ('scaler', StandardScaler())
    ]
)

categorical_features = X_train.select_dtypes(include=[object]).columns
categorical_transformer = Pipeline(
    steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Fit models

In [4]:
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor

model_elastic_net = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', ElasticNet(alpha=0.2))
    ]
)
model_elastic_net.fit(X=X_train, y=y_train)

model_decision_tree = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', DecisionTreeRegressor())
    ]
)
model_decision_tree.fit(X=X_train, y=y_train)
Out[4]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['district'], dtype='object'))])),
                ('model', DecisionTreeRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Create dalex Explainer for each model

In [5]:
exp_elastic_net = dx.Explainer(model_elastic_net, data=X_test, y=y_test)
Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : sklearn.linear_model._coordinate_descent.ElasticNet (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x11fc8c360> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 2.08e+03, mean = 3.5e+03, max = 5.4e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -5.9e+02, mean = 6.8, max = 1.39e+03
  -> model_info        : package sklearn

A new explainer has been created!
In [6]:
exp_decision_tree = dx.Explainer(model_decision_tree, data=X_test, y=y_test)
Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : sklearn.tree._classes.DecisionTreeRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x11fc8c360> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.62e+03, mean = 3.51e+03, max = 6.6e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.28e+03, mean = 5.93, max = 9.1e+02
  -> model_info        : package sklearn

A new explainer has been created!
In [7]:
exp_elastic_net.model_performance()
Out[7]:
mse rmse r2 mae mad
ElasticNet 197572.651052 444.491452 0.756325 358.637412 393.033877
In [8]:
exp_decision_tree.model_performance()
Out[8]:
mse rmse r2 mae mad
DecisionTreeRegressor 53373.665667 231.027413 0.934172 148.900778 81.0

Arena features

Live mode using all available observations

In [9]:
# create empty Arena
arena = dx.Arena()
# push created explainer
arena.push_model(exp_elastic_net)
# push whole test dataset (including target column)
arena.push_observations(test)
# run server on port 9294
arena.run_server(port=9294)
https://arena.drwhy.ai/?data=http://127.0.0.1:9294/

The server is updating automatically. One can add the second model while it is running.

In [10]:
arena.push_model(exp_decision_tree)

And a third one!

In [11]:
from lightgbm import LGBMRegressor
model_gbm = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', LGBMRegressor())
    ]
)
model_gbm.fit(X=X_train, y=y_train)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000214 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 259
[LightGBM] [Info] Number of data points in the train set: 1000, number of used features: 14
[LightGBM] [Info] Start training from score 3487.019000
Out[11]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['district'], dtype='object'))])),
                ('model', LGBMRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [12]:
exp_gbm = dx.Explainer(model_gbm, data=X_test, y=y_test)
Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : lightgbm.sklearn.LGBMRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x11fc8c360> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.63e+03, mean = 3.5e+03, max = 6.43e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -5.02e+02, mean = 8.94, max = 7.27e+02
  -> model_info        : package sklearn

A new explainer has been created!
In [13]:
arena.push_model(exp_gbm)

Stop the server using this method:

In [14]:
arena.stop_server()

Static mode using a subset of observations

Create an Arena exacly the same way.

In [15]:
# create empty Arena
arena = dx.Arena()
# this takes too long to compute
arena.set_option('DatasetShapleyValues', 'N', 10) 
# push created explainers
arena.push_model(exp_gbm)
arena.push_model(exp_decision_tree)
# push first 3 rows of tasting dataset
arena.push_observations(test.iloc[0:3])
# save arena to file
arena.save("data.json")
Shapley Values: 100%|██████████| 6/6 [00:55<00:00,  9.32s/it]
Variable Importance: 100%|██████████| 2/2 [00:01<00:00,  1.79it/s]
Partial Dependence: 100%|██████████| 10/10 [00:00<00:00, 32.91it/s]
Accumulated Dependence: 100%|██████████| 10/10 [00:00<00:00, 15.13it/s]
Ceteris Paribus: 100%|██████████| 30/30 [00:00<00:00, 221.82it/s]
Break Down: 100%|██████████| 6/6 [00:00<00:00, 13.23it/s]
Metrics: 100%|██████████| 2/2 [00:00<00:00, 729.95it/s]
Fairness: 100%|██████████| 10/10 [00:00<00:00, 17.12it/s]
Shapley Values Dependence: 100%|██████████| 10/10 [00:34<00:00,  3.50s/it]
Shapley Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 63.20it/s]

You can automatically upload this data source to the GitHub Gist service. By default, OAuth is used, but you can provide a Personal Access Token using the token argument.

In [16]:
arena.upload(open_browser=False)
Shapley Values: 100%|██████████| 6/6 [00:00<00:00, 9317.22it/s]
Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 1366.89it/s]
Partial Dependence: 100%|██████████| 10/10 [00:00<00:00, 39053.11it/s]
Accumulated Dependence: 100%|██████████| 10/10 [00:00<00:00, 27413.75it/s]
Ceteris Paribus: 100%|██████████| 30/30 [00:00<00:00, 50091.21it/s]
Break Down: 100%|██████████| 6/6 [00:00<00:00, 11949.58it/s]
Metrics: 100%|██████████| 2/2 [00:00<00:00, 8481.91it/s]
Fairness: 100%|██████████| 10/10 [00:00<00:00, 24877.25it/s]
Shapley Values Dependence: 100%|██████████| 10/10 [00:00<00:00, 17734.90it/s]
Shapley Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 8073.73it/s]
Out[16]:
'https://arena.drwhy.ai/?data=https://gist.githubusercontent.com/hbaniecki/7b647b3f22a438725fe038e20a0a4493/raw/60d35a4dddb4e1d653c3eb1bd5b9fdd5a0314d59/datasource.json'
In [19]:
arena=dx.Arena()
arena.push_model(exp_decision_tree)
arena.push_observations(test)
arena.run_server(port=9294)

arena.print_options()
https://arena.drwhy.ai/?data=http://127.0.0.1:9294/

ShapleyValues
---------------------------------
B: 20   #Number of random paths
cpus: 4   #Number of parallel processes

VariableImportance
---------------------------------
N: None   #Number of observations to use. None for all.
B: 10   #Number of permutation rounds to perform each variable

PartialDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 101   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

AccumulatedDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 101   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

CeterisParibus
---------------------------------
grid_points: 101   #Maximum number of points for profile
grid_type: quantile   #grid type "quantile" or "uniform"

ROC
---------------------------------
grid_points: 101   #Maximum number of points for ROC curve

Fairness
---------------------------------
cutoffs: [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]   #List of tested cutoff levels

DatasetShapleyValues
---------------------------------
B: 4   #Number of random paths
N: 500   #Number of randomly sampled rows from dataset
cpus: 4   #Number of parallel processes

VariableAgainstAnother
---------------------------------
points_number: 150   #Maximum sample size to visualize in the variable against another scatter plot

VariableDistribution
---------------------------------
bins: [5, 10, 15, 20, 25, 30, 35, 40]   #List of available bin counts for the variable distribution plot
Address already in use
Port 9294 is in use by another program. Either identify and stop that program, or start the server with a different port.

You can easily change options for charts and the dashboard will be automatically refreshed.

In [20]:
# Chart-specific
arena.set_option('CeterisParibus', 'grid_type', 'uniform')
# For all charts
arena.set_option(None, 'grid_points', 200)
In [22]:
arena.print_options()
ShapleyValues
---------------------------------
B: 20   #Number of random paths
cpus: 4   #Number of parallel processes

VariableImportance
---------------------------------
N: None   #Number of observations to use. None for all.
B: 10   #Number of permutation rounds to perform each variable

PartialDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 200   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

AccumulatedDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 200   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

CeterisParibus
---------------------------------
grid_points: 200   #Maximum number of points for profile
grid_type: uniform   #grid type "quantile" or "uniform"

ROC
---------------------------------
grid_points: 200   #Maximum number of points for ROC curve

Fairness
---------------------------------
cutoffs: [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]   #List of tested cutoff levels

DatasetShapleyValues
---------------------------------
B: 4   #Number of random paths
N: 500   #Number of randomly sampled rows from dataset
cpus: 4   #Number of parallel processes

VariableAgainstAnother
---------------------------------
points_number: 150   #Maximum sample size to visualize in the variable against another scatter plot

VariableDistribution
---------------------------------
bins: [5, 10, 15, 20, 25, 30, 35, 40]   #List of available bin counts for the variable distribution plot

Plots

This package uses plotly to render the plots:

Resources - https://dalex.drwhy.ai/python