cinnamon.drift.model_drift_explainer.ModelDriftExplainer

class cinnamon.drift.model_drift_explainer.ModelDriftExplainer(model, iteration_range: Optional[Tuple[int, int]] = None, task: Optional[str] = None)

Tool to study data drift between two datasets, in a context where “model” is used to make predictions.

Parameters

modela XGBoost model (either XGBClassifier, XGBRegressor, XGBRanker, Booster)

The model used to make predictions.

iteration_rangeTuple[int, int], optional (default=None)

Specifies which layer of trees are used. For example, if XGBoost is trained with 100 rounds, specifying iteration_range=(10, 20) then only the trees built during [10, 20) (half open set) iterations are used. If None, all trees are used.

taskstring

Task corresponding to the (X, Y) data. Either “regression”, “classification”, or “ranking”. “task” must be provided if the model is treated as a black box predictor (no specific parser for the model).

Attributes

predictions1numpy array

Array of predictions of “model” on X1 (for classification, corresponds to raw predictions).

predictions2numpy array

Array of predictions of “model” on X2 (for classification, corresponds to raw predictions).

pred_proba1numpy array

Array of predicted probabilities of “model” on X1 (equal to None if regression or ranking).

pred_proba2numpy array

Array of predicted probabilities of “model” on X2 (equal to None if regression or ranking).

iteration_rangetuple of integers

Layer of trees used.

feature_driftslist of dict

Drift measures for each input feature in X.

target_driftdict

Drift measures for the labels y.

n_featuresint

Number of features in input X.

feature_nameslist of string

Feature names for input X.

class_nameslist of string

Class names of the target when task is “classification”. Otherwise equal to None.

cat_feature_indiceslist of int

Indexes of categorical features in input X (not implemented yet: only numerical features are allowed currently).

X1, X2pandas dataframes

X1 and X2 inputs passed to the “fit” method.

y1, y2numpy arrays

y1 and y2 targets passed to the “fit” method.

sample_weights1, sample_weights2numpy arrays

sample_weights1 and sample_weights2 arrays passed to the “fit” method.

__init__(model, iteration_range: Optional[Tuple[int, int]] = None, task: Optional[str] = None)
fit(X1: DataFrame, X2: DataFrame, y1: Optional[array] = None, y2: Optional[array] = None, sample_weights1: Optional[array] = None, sample_weights2: Optional[array] = None, cat_feature_indices: Optional[List[int]] = None)

Fit the model drift explainer to dataset 1 and dataset 2.

Parameters

X1pandas dataframe of shape (n_samples, n_features)

Dataset 1 inputs.

X2pandas dataframe of shape (n_samples, n_features)

Dataset 2 inputs.

y1numpy array of shape (n_samples,), optional (default=None)

Dataset 1 labels. If None, data drift is only analyzed based on inputs X1 and X2

y2numpy array of shape (n_samples,), optional (default=None)

Dataset 2 labels. If None, data drift is only analyzed based on inputs X1 and X2

sample_weights1: numpy array of shape (n_samples,), optional (default=None)

Array of weights that are assigned to individual samples of dataset 1 If None, then each sample of dataset 1 is given unit weight.

sample_weights2: numpy array of shape (n_samples,), optional (default=None)

Array of weights that are assigned to individual samples of dataset 2 If None, then each sample of dataset 2 is given unit weight.

cat_feature_indices: TODO

Returns

ModelDriftExplainer

The fitted model drift explainer.

get_performance_metrics_drift() PerformanceMetricsDrift

Compute performance metrics on dataset 1 and dataset 2.

Returns

Dictionary of performance metrics

get_prediction_drift(prediction_type: str = 'raw') List[DriftMetricsNum]

Compute drift measures based on model predictions.

See the documentation in README for explanations about how it is computed, especially the slide presentation.

Parameters

prediction_type: str, optional (default=”raw”)

Type of predictions to consider. Choose among: - “raw” : logit predictions (binary classification), log-softmax predictions (multiclass classification), regular predictions (regression) - “proba” : predicted probabilities (only for classification model) - “class”: predicted classes (only for classification model)

Returns

prediction_driftlist of DriftMetricsNum object

Drift measures for each predicted dimension.

get_tree_based_correction_weights(max_depth: Optional[int] = None, max_ratio: int = 10) array

Not recommended way to compute correction weights for data drift (only for research purpose). AdversarialDriftExplainer should be preferred for this purpose. The approach is to use similar ideas as in get_tree_based_drift_importances in order to estimate correction weights (but first experiments show it has bad performance).

Parameters

max_depthint, optional (default=None)

Depth at which the ratio of node weights are computed If None, ratio are computed in terminal leaves

max_ratio: int, optional (default=10)

Maximum ratio between two weights returned in correction_weights (weights are thresholded so that the ratio between two weights do not exceed max_ratio)

Returns

correction_weightsnp.array

Array of correction weights for the samples of dataset 1

get_tree_based_drift_importances(type: str = 'mean') array

Compute drift values using the tree structures present in the model.

See the documentation in README for explanations about how it is computed, especially the slide presentation.

Parameters

type: str, optional (default=”node_size”)

Method used for drift values computation. Choose among: - “node_size” (recommended) - “mean” - “mean_norm”

See details in slide presentation.

Returns

drift_importances : numpy array