Arena for Python¶

Load Data¶

import dalex as dx

import warnings
warnings.filterwarnings('ignore')

import plotly
plotly.offline.init_notebook_mode()

dx.__version__

'1.7.0'

train = dx.datasets.load_apartments()
test = dx.datasets.load_apartments_test()

X_train = train.drop(columns='m2_price')
y_train = train["m2_price"]

X_test= test.drop(columns='m2_price')
y_test = test["m2_price"]

Preprocessing¶

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
numerical_features = X_train.select_dtypes(exclude=[object]).columns
numerical_transformer = Pipeline(
    steps=[
        ('scaler', StandardScaler())
    ]
)

categorical_features = X_train.select_dtypes(include=[object]).columns
categorical_transformer = Pipeline(
    steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Fit models¶

from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor

model_elastic_net = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', ElasticNet(alpha=0.2))
    ]
)
model_elastic_net.fit(X=X_train, y=y_train)

model_decision_tree = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', DecisionTreeRegressor())
    ]
)
model_decision_tree.fit(X=X_train, y=y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['district'], dtype='object'))])),
                ('model', DecisionTreeRegressor())])

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['district'], dtype='object'))])),
                ('model', DecisionTreeRegressor())])

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                ('cat',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 Index(['district'], dtype='object'))])

Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')

StandardScaler()

Index(['district'], dtype='object')

OneHotEncoder(handle_unknown='ignore')

DecisionTreeRegressor()

Create dalex Explainer for each model¶

exp_elastic_net = dx.Explainer(model_elastic_net, data=X_test, y=y_test)

Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : sklearn.linear_model._coordinate_descent.ElasticNet (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x11fc8c360> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 2.08e+03, mean = 3.5e+03, max = 5.4e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -5.9e+02, mean = 6.8, max = 1.39e+03
  -> model_info        : package sklearn

A new explainer has been created!

exp_decision_tree = dx.Explainer(model_decision_tree, data=X_test, y=y_test)

Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : sklearn.tree._classes.DecisionTreeRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x11fc8c360> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.62e+03, mean = 3.51e+03, max = 6.6e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.28e+03, mean = 5.93, max = 9.1e+02
  -> model_info        : package sklearn

A new explainer has been created!

exp_elastic_net.model_performance()

exp_decision_tree.model_performance()

Arena features¶

Live mode using all available observations¶

# create empty Arena
arena = dx.Arena()
# push created explainer
arena.push_model(exp_elastic_net)
# push whole test dataset (including target column)
arena.push_observations(test)
# run server on port 9294
arena.run_server(port=9294)

https://arena.drwhy.ai/?data=http://127.0.0.1:9294/

The server is updating automatically. One can add the second model while it is running.

arena.push_model(exp_decision_tree)

And a third one!

from lightgbm import LGBMRegressor
model_gbm = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', LGBMRegressor())
    ]
)
model_gbm.fit(X=X_train, y=y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000214 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 259
[LightGBM] [Info] Number of data points in the train set: 1000, number of used features: 14
[LightGBM] [Info] Start training from score 3487.019000

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['district'], dtype='object'))])),
                ('model', LGBMRegressor())])

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['district'], dtype='object'))])),
                ('model', LGBMRegressor())])

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('scaler', StandardScaler())]),
                                 Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                ('cat',
                                 Pipeline(steps=[('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 Index(['district'], dtype='object'))])

Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')

StandardScaler()

Index(['district'], dtype='object')

OneHotEncoder(handle_unknown='ignore')

LGBMRegressor()

exp_gbm = dx.Explainer(model_gbm, data=X_test, y=y_test)

Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : lightgbm.sklearn.LGBMRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x11fc8c360> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.63e+03, mean = 3.5e+03, max = 6.43e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -5.02e+02, mean = 8.94, max = 7.27e+02
  -> model_info        : package sklearn

A new explainer has been created!

arena.push_model(exp_gbm)

Stop the server using this method:

arena.stop_server()

Static mode using a subset of observations¶

Create an Arena exacly the same way.

# create empty Arena
arena = dx.Arena()
# this takes too long to compute
arena.set_option('DatasetShapleyValues', 'N', 10) 
# push created explainers
arena.push_model(exp_gbm)
arena.push_model(exp_decision_tree)
# push first 3 rows of tasting dataset
arena.push_observations(test.iloc[0:3])
# save arena to file
arena.save("data.json")

Shapley Values: 100%|██████████| 6/6 [00:55<00:00,  9.32s/it]
Variable Importance: 100%|██████████| 2/2 [00:01<00:00,  1.79it/s]
Partial Dependence: 100%|██████████| 10/10 [00:00<00:00, 32.91it/s]
Accumulated Dependence: 100%|██████████| 10/10 [00:00<00:00, 15.13it/s]
Ceteris Paribus: 100%|██████████| 30/30 [00:00<00:00, 221.82it/s]
Break Down: 100%|██████████| 6/6 [00:00<00:00, 13.23it/s]
Metrics: 100%|██████████| 2/2 [00:00<00:00, 729.95it/s]
Fairness: 100%|██████████| 10/10 [00:00<00:00, 17.12it/s]
Shapley Values Dependence: 100%|██████████| 10/10 [00:34<00:00,  3.50s/it]
Shapley Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 63.20it/s]

You can automatically upload this data source to the GitHub Gist service. By default, OAuth is used, but you can provide a Personal Access Token using the token argument.

arena.upload(open_browser=False)

Shapley Values: 100%|██████████| 6/6 [00:00<00:00, 9317.22it/s]
Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 1366.89it/s]
Partial Dependence: 100%|██████████| 10/10 [00:00<00:00, 39053.11it/s]
Accumulated Dependence: 100%|██████████| 10/10 [00:00<00:00, 27413.75it/s]
Ceteris Paribus: 100%|██████████| 30/30 [00:00<00:00, 50091.21it/s]
Break Down: 100%|██████████| 6/6 [00:00<00:00, 11949.58it/s]
Metrics: 100%|██████████| 2/2 [00:00<00:00, 8481.91it/s]
Fairness: 100%|██████████| 10/10 [00:00<00:00, 24877.25it/s]
Shapley Values Dependence: 100%|██████████| 10/10 [00:00<00:00, 17734.90it/s]
Shapley Variable Importance: 100%|██████████| 2/2 [00:00<00:00, 8073.73it/s]

'https://arena.drwhy.ai/?data=https://gist.githubusercontent.com/hbaniecki/7b647b3f22a438725fe038e20a0a4493/raw/60d35a4dddb4e1d653c3eb1bd5b9fdd5a0314d59/datasource.json'

Chart options¶

Options are described for each plot in the official Arena's Guide:

Short description are available using the print_options method.

arena=dx.Arena()
arena.push_model(exp_decision_tree)
arena.push_observations(test)
arena.run_server(port=9294)

arena.print_options()

https://arena.drwhy.ai/?data=http://127.0.0.1:9294/

ShapleyValues
---------------------------------
B: 20   #Number of random paths
cpus: 4   #Number of parallel processes

VariableImportance
---------------------------------
N: None   #Number of observations to use. None for all.
B: 10   #Number of permutation rounds to perform each variable

PartialDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 101   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

AccumulatedDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 101   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

CeterisParibus
---------------------------------
grid_points: 101   #Maximum number of points for profile
grid_type: quantile   #grid type "quantile" or "uniform"

ROC
---------------------------------
grid_points: 101   #Maximum number of points for ROC curve

Fairness
---------------------------------
cutoffs: [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]   #List of tested cutoff levels

DatasetShapleyValues
---------------------------------
B: 4   #Number of random paths
N: 500   #Number of randomly sampled rows from dataset
cpus: 4   #Number of parallel processes

VariableAgainstAnother
---------------------------------
points_number: 150   #Maximum sample size to visualize in the variable against another scatter plot

VariableDistribution
---------------------------------
bins: [5, 10, 15, 20, 25, 30, 35, 40]   #List of available bin counts for the variable distribution plot

Address already in use
Port 9294 is in use by another program. Either identify and stop that program, or start the server with a different port.

You can easily change options for charts and the dashboard will be automatically refreshed.

# Chart-specific
arena.set_option('CeterisParibus', 'grid_type', 'uniform')
# For all charts
arena.set_option(None, 'grid_points', 200)

arena.print_options()

ShapleyValues
---------------------------------
B: 20   #Number of random paths
cpus: 4   #Number of parallel processes

VariableImportance
---------------------------------
N: None   #Number of observations to use. None for all.
B: 10   #Number of permutation rounds to perform each variable

PartialDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 200   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

AccumulatedDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 200   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

CeterisParibus
---------------------------------
grid_points: 200   #Maximum number of points for profile
grid_type: uniform   #grid type "quantile" or "uniform"

ROC
---------------------------------
grid_points: 200   #Maximum number of points for ROC curve

Fairness
---------------------------------
cutoffs: [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]   #List of tested cutoff levels

DatasetShapleyValues
---------------------------------
B: 4   #Number of random paths
N: 500   #Number of randomly sampled rows from dataset
cpus: 4   #Number of parallel processes

VariableAgainstAnother
---------------------------------
points_number: 150   #Maximum sample size to visualize in the variable against another scatter plot

VariableDistribution
---------------------------------
bins: [5, 10, 15, 20, 25, 30, 35, 40]   #List of available bin counts for the variable distribution plot

Plots¶

This package uses plotly to render the plots:

Install extentions to use plotly in JupyterLab: Getting Started Troubleshooting
Use show=False parameter in plot method to return plotly Figure object
It is possible to edit the figures and save them

Resources - https://dalex.drwhy.ai/python ¶

Introduction to the dalex package: Titanic: tutorial and examples
Key features explained: FIFA20: explain default vs tuned model with dalex
How to use dalex with: xgboost, tensorflow, h2o (feat. autokeras, catboost, lightgbm)
More explanations: residuals, shap, lime
Introduction to the Fairness module in dalex
Introduction to the Aspect module in dalex
Introduction to Arena: interactive dashboard for model exploration
Code in the form of jupyter notebook
Changelog: NEWS
Theoretical introduction to the plots: Explanatory Model Analysis: Explore, Explain, and Examine Predictive Models