cinnamon.drift.model_drift_explainer.ModelDriftExplainer¶
- class cinnamon.drift.model_drift_explainer.ModelDriftExplainer(model, iteration_range: Optional[Tuple[int, int]] = None, task: Optional[str] = None)¶
Study data drift through the lens of a ML model or ML pipeline.
Parameters¶
- modela ML model or ML pipeline (see “Supported Model” section).
The model used to make predictions.
- iteration_rangeTuple[int, int], optional (default=None)
Only for tree based models. Specifies which layer of trees are used. For example, if XGBoost is trained with 100 rounds, with iteration_range=(10, 20) then only the trees built during [10, 20) iterations are used. If None, all trees are used.
- taskstring, optional (default=None)
Task corresponding to the (X, Y) data. Either “regression”, “classification”, or “ranking”. “task” is a mandatory parameter if the model is treated as a black box predictor.
Attributes¶
- predictions1numpy array
Array of predictions of “model” on X1 dataset. For classification, corresponds to raw (logit of log-softmax) predictions.
- predictions2numpy array
Array of predictions of “model” on X2 dataset. For classification, corresponds to raw (logit of log-softmax) predictions.
- pred_proba1numpy array
Array of predicted probabilities of “model” on X1 (equal to None if task is regression or ranking).
- pred_proba2numpy array
Array of predicted probabilities of “model” on X2 (equal to None if task is regression or ranking).
- iteration_rangetuple of integers
Layer of trees used.
- feature_driftslist of Union[DriftMetricsCat, DriftMetricsNum]
Drift measures for each input feature in X.
- target_driftUnion[DriftMetricsCat, DriftMetricsNum]
Drift measures for the labels y.
- n_featuresint
Number of features in input X.
- feature_nameslist of string
Feature names for input X.
- class_nameslist of string
Class names of the target when task is “classification”. Otherwise equal to None.
- cat_feature_indiceslist of int
Indexes of categorical features in input X.
- X1, X2pandas dataframes
X1 and X2 inputs passed to the “fit” method.
- y1, y2numpy arrays
y1 and y2 targets passed to the “fit” method.
- sample_weights1, sample_weights2numpy arrays
sample_weights1 and sample_weights2 arrays passed to the “fit” method.
- fit(X1: DataFrame, X2: DataFrame, y1: Optional[array] = None, y2: Optional[array] = None, sample_weights1: Optional[array] = None, sample_weights2: Optional[array] = None, cat_feature_indices: Optional[List[int]] = None)¶
Fit the model drift explainer to dataset 1 and dataset 2.
Parameters¶
- X1pandas dataframe of shape (n_samples, n_features)
Dataset 1 inputs.
- X2pandas dataframe of shape (n_samples, n_features)
Dataset 2 inputs.
- y1numpy array of shape (n_samples,), optional (default=None)
Dataset 1 labels. If None, data drift is only analyzed based on inputs X1 and X2
- y2numpy array of shape (n_samples,), optional (default=None)
Dataset 2 labels. If None, data drift is only analyzed based on inputs X1 and X2
- sample_weights1: numpy array of shape (n_samples,), optional (default=None)
Array of weights that are assigned to individual samples of dataset 1 If None, then each sample of dataset 1 is given unit weight.
- sample_weights2: numpy array of shape (n_samples,), optional (default=None)
Array of weights that are assigned to individual samples of dataset 2 If None, then each sample of dataset 2 is given unit weight.
cat_feature_indices: list of int Indexes of categorical features in input X.
Returns¶
- ModelDriftExplainer
The fitted model drift explainer.
- get_model_agnostic_drift_importances(type: str = 'mean', prediction_type: str = 'raw', max_ratio: float = 10, max_n_cat: int = 20) array¶
Compute drift importances using the model agnostic method.
See the documentation in README for explanations about how it is computed, especially the slide presentation.
Parameters¶
- type: str, optional (default=”mean”)
Method used for drift importances computation. Choose among: - “mean” - “wasserstein”
See details in slide presentation.
- prediction_type: str, optional (default=”raw”)
Choose among: - “raw” - “proba”: predicted probability if task == ‘classification’ - “class”: predicted class if task == ‘classification’
- max_ratio: int, optional (default=10)
Only used for categorical features
- max_n_cat: int, optional (default=20)
Only used for categorical features
Returns¶
drift_importances : numpy array
- get_performance_metrics_drift() PerformanceMetricsDrift¶
Compute performance metrics on dataset 1 and dataset 2.
Returns¶
- performance_metrics_drift: PerformanceMetricsDrift object
Comparison of either RegressionMetrics or ClassificaionMetrics objects.
- get_prediction_drift(prediction_type: str = 'raw') List[DriftMetricsNum]¶
Compute drift measures based on model predictions.
Parameters¶
- prediction_type: str, optional (default=”raw”)
Type of predictions to consider. Choose among: - “raw” : logit predictions (binary classification), log-softmax predictions (multiclass classification), regular predictions (regression) - “proba” : predicted probabilities (only for classification model) - “class”: predicted classes (only for classification model)
Returns¶
- prediction_driftlist of DriftMetricsNum or DriftMetricsCat objects
Drift measures for each predicted dimension.
- get_tree_based_correction_weights(max_depth: Optional[int] = None, max_ratio: int = 10) array¶
Not recommended way to compute correction weights for data drift (only for research purpose). AdversarialDriftExplainer should be preferred for this purpose. The approach is to use similar ideas as in get_tree_based_drift_importances in order to estimate correction weights (but first experiments show it has bad performance).
Parameters¶
- max_depthint, optional (default=None)
Depth at which the ratio of node weights are computed If None, ratio are computed in terminal leaves
- max_ratio: int, optional (default=10)
Maximum ratio between two weights returned in correction_weights (weights are thresholded so that the ratio between two weights do not exceed max_ratio)
Returns¶
- correction_weightsnp.array
Array of correction weights for the samples of dataset 1
- get_tree_based_drift_importances(type: str = 'mean') array¶
Compute drift importances using the tree structure of the model.
See the documentation in README for explanations about how it is computed, especially the slide presentation.
Parameters¶
- type: str, optional (default=”mean”)
Method used for drift importances computation. Choose among: - “node_size” - “mean” - “mean_norm”
See details in slide presentation.
Returns¶
drift_importances : numpy array