Arena For Python

Load Data

In [1]:
import dalex as dx

import warnings
warnings.filterwarnings('ignore')

dx.__version__
Out[1]:
'1.0.0'
In [2]:
train = dx.datasets.load_apartments()
test = dx.datasets.load_apartments_test()

X_train = train.drop(columns='m2_price')
y_train = train["m2_price"]

X_test= test.drop(columns='m2_price')
y_test = test["m2_price"]

Preprocessing

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
numerical_features = X_train.select_dtypes(exclude=[object]).columns
numerical_transformer = Pipeline(
    steps=[
        ('scaler', StandardScaler())
    ]
)

categorical_features = X_train.select_dtypes(include=[object]).columns
categorical_transformer = Pipeline(
    steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Fit models

In [4]:
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
model_elastic_net = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', ElasticNet())
    ]
)
model_elastic_net.fit(X=X_train, y=y_train)
model_decision_tree = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('model', DecisionTreeRegressor())
    ]
)
model_decision_tree.fit(X=X_train, y=y_train)
Out[4]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['construction_year', 'surface', 'floor', 'no_rooms'], dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['district'], dtype='object'))])),
                ('model', DecisionTreeRegressor())])

Create dalex Explainer for each model

In [5]:
exp_elastic_net = dx.Explainer(model_elastic_net, data=X_test, y=y_test)
exp_decision_tree = dx.Explainer(model_decision_tree, data=X_test, y=y_test)
Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : sklearn.linear_model._coordinate_descent.ElasticNet (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x0000020F1A5043A0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 2.46e+03, mean = 3.5e+03, max = 4.66e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -9.47e+02, mean = 11.4, max = 2.16e+03
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 9000 rows 5 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 9000 values
  -> model_class       : sklearn.tree._classes.DecisionTreeRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x0000020F1A5043A0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.61e+03, mean = 3.51e+03, max = 6.6e+03
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.19e+03, mean = 3.23, max = 1e+03
  -> model_info        : package sklearn

A new explainer has been created!

Arena features

Live mode using all available observations

In [6]:
# create empty Arena
arena=dx.Arena()
# push created explainer
arena.push_model(exp_elastic_net)
# push whole test dataset (including target column)
arena.push_observations(test)
# run server on port 9294
arena.run_server(port=9294)
https://arena.drwhy.ai/?data=http://127.0.0.1:9294/

Server is auto updating. You can add second model when it is running.

In [7]:
arena.push_model(exp_decision_tree)

You can stop the server using this method

In [8]:
arena.stop_server()

Static mode using subset of observations

You create Arena exacly the same way.

In [9]:
# create empty Arena
arena=dx.Arena()
# push created explainers
arena.push_model(exp_elastic_net)
arena.push_model(exp_decision_tree)
# push first 3 rows of tasting dataset
arena.push_observations(test.iloc[0:3])
# save arena to file
arena.save("data.json")

You can auto upload this data source to GitHub Gist service. By default OAuth is used, but you can provide your Personal Access Token using token argument.

In [ ]:
arena.upload(open_browser=False)
In [11]:
arena=dx.Arena()
arena.push_model(exp_decision_tree)
arena.push_observations(test)
arena.run_server(port=9294)

arena.print_options()
https://arena.drwhy.ai/?data=http://127.0.0.1:9294/

SHAPValues
---------------------------------
B: 10   #Number of random paths

FeatureImportance
---------------------------------
N: None   #Number of observations to use. None for all.
B: 10   #Number of permutation rounds to perform each variable

PartialDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 101   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

AccumulatedDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 101   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

CeterisParibus
---------------------------------
grid_points: 101   #Maximum number of points for profile
grid_type: quantile   #grid type "quantile" or "uniform"

Breakdown
---------------------------------

Metrics
---------------------------------

ROC
---------------------------------
grid_points: 101   #Maximum number of points for ROC curve

You can easily change options for charts and dashboard will be automaticly refreshed.

In [12]:
# Chart-specific
arena.set_option('CeterisParibus', 'grid_type', 'uniform')
# For all charts
arena.set_option(None, 'grid_points', 200)
In [13]:
arena.print_options()
SHAPValues
---------------------------------
B: 10   #Number of random paths

FeatureImportance
---------------------------------
N: None   #Number of observations to use. None for all.
B: 10   #Number of permutation rounds to perform each variable

PartialDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 200   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

AccumulatedDependence
---------------------------------
grid_type: quantile   #grid type "quantile" or "uniform"
grid_points: 200   #Maximum number of points for profile
N: 500   #Number of observations to use. None for all.

CeterisParibus
---------------------------------
grid_points: 200   #Maximum number of points for profile
grid_type: uniform   #grid type "quantile" or "uniform"

Breakdown
---------------------------------

Metrics
---------------------------------

ROC
---------------------------------
grid_points: 200   #Maximum number of points for ROC curve

Cache [Advanced]

Cache contains already generated charts. In live mode there are those charts, that user have opened. In static mode cache contains all charts if precalculate=True or save method was called.

In [14]:
# default way with precalculate=False
arena=dx.Arena()
arena.push_model(exp_elastic_net)
print(len(arena.cache))
arena.save('data.json')
print(len(arena.cache))
0
12
In [15]:
# default way with precalculate=True
arena=dx.Arena(precalculate=True)
arena.push_model(exp_elastic_net)
print(len(arena.cache))
arena.push_model(exp_decision_tree)
print(len(arena.cache))
12
24

Filling and clearing cache

In [16]:
print(len(arena.cache))

arena.clear_cache()
print(len(arena.cache))

arena.fill_cache()
print(len(arena.cache))
24
0
24

Options

Changing options removes specified charts for cache. If precalculate is True, then charts are generated again.

In [17]:
# precalculate is enabled
print(len(arena.cache))
arena.set_option('FeatureImportance', 'B', 5)
print(len(arena.cache))
24
24
In [18]:
arena.precalculate = False
print(len(arena.cache))
arena.set_option('FeatureImportance', 'B', 5)
print(len(arena.cache))
24
22
In [ ]:
 
In [ ]: