CFXplorer: 生成反事实解释的Python包

磐创AI

发布于 2024-06-07 14:40:48

3860

随着机器学习模型在现实场景中的应用越来越广泛，解释模型的可解释性变得越来越重要。了解模型如何做出决策不仅有益于模型的用户，还有助于受模型决策影响的人们理解。为了解决这个问题，人们开发了反事实解释，因为它们允许个体了解通过扰动原始数据如何实现期望的结果。在短期内，反事实解释可能向受机器学习模型决策影响的人提供可行的建议。例如，一个被拒绝贷款申请的人可以了解这次可以采取什么措施来获得接受，并且这对改进下一次申请是有用的。

Lucic等人[1]提出了FOCUS，旨在为基于树的机器学习模型中的所有实例生成原始数据的最优距离反事实解释。

CFXplorer是一个Python包，使用FOCUS算法为给定的模型和数据生成反事实解释。本文介绍并展示了如何使用CFXplorer生成反事实解释。

链接

GitHub仓库: https://github.com/kyosek/CFXplorer

文档: https://cfxplorer.readthedocs.io/en/latest/?badge=latest

PyPI: https://pypi.org/project/CFXplorer/

1.FOCUS算法

本节简要介绍FOCUS算法。

生成反事实解释是一个已经被多种现有方法解决的问题。Wachter、Mittelstadt和Russell[2]将这个问题转化为一个优化框架，然而，这种方法局限于可微分模型。FOCUS旨在通过引入概率模型逼近，将该框架扩展到不可微分模型，特别是基于树的算法。该方法的一个关键方面是通过用具有参数σ的sigmoid函数替换每个树中的每个分裂来逼近预训练的基于树的模型，表示为f：

其中σ ∈ R>0。

这个sigmoid函数被整合到近似树基模型f对于给定输入x的节点j激活t_j(x)的函数t ̃_j(x)中。该函数定义为：

其中θ_j是节点j激活的阈值。

这种方法逼近单一决策树T。树的近似可以定义为：

此外，该方法通过温度τ ∈ R>0的softmax函数取代了f的M个带权重ω_m的树的最大操作。因此，逼近f ̃可以表示为：

值得注意的是，这种逼近方法可以应用于任何基于树的模型。

FOCUS算法的主要声明是，该方法能够(i)为数据集中的所有实例生成反事实解释，并且(ii)对于基于树的算法，找到比现有框架更接近原始输入的反事实解释。

2.CFXplorer示例

本节演示了如何使用CFXplorer包的两个示例。第一个是一个简单的例子，你可以在其中学习包的基本用法。第二个示例显示了如何使用Optuna[3]包搜索FOCUS的最优超参数。正如本文在前一节中介绍的，FOCUS有一些超参数，可以通过与超参数调整包集成来优化。

2.1. 简单示例

在这个简单的例子中，我们创建随机数据、一个决策树模型，并使用CFXplorer生成反事实解释。Python包CFXplorer使用FOCUS算法生成反事实解释。本节演示了如何使用该包来实现这一点。

安装

你可以使用pip安装该包：

pip install CFXplorer

首先，导入包。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from cfxplorer import Focus
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier

我们创建了一个虚拟数据集，以供决策树模型使用。

def generate_example_data(rows: int = 1000):
    """
    Generate random data with a binary target variable and 10 features.

    Args:
        rows (int): The number of rows in the generated dataset.

    Returns:
        pandas.DataFrame: A DataFrame containing the randomly generated data.

    """
    X, y = make_classification(
        n_samples=rows, n_features=10, n_classes=2, random_state=42
    )

    return train_test_split(X, y, test_size=0.2, random_state=42)

CFXplorer仅接受标准化的特征值（在0和1之间），因此我们需要对它们进行缩放。

def standardize_features(x_train, x_test):
    """
    Standardizes the features of the input data using Min-Max scaling.

    Args:
        x_train (pandas.DataFrame or numpy.ndarray): The training data.
        x_test (pandas.DataFrame or numpy.ndarray): The test data.

    Returns:
        tuple: A tuple containing two pandas DataFrames.
            - The first DataFrame contains the standardized features of the training data.
            - The second DataFrame contains the standardized features of the test data.
    """
    # Create a MinMaxScaler object
    scaler = MinMaxScaler(feature_range=(0, 1))

    # Fit and transform the data to perform feature scaling
    scaler = scaler.fit(x_train)
    scaled_x_train = scaler.transform(x_train)
    scaled_x_test = scaler.transform(x_test)

    # Create a new DataFrame with standardized features
    standardized_train = pd.DataFrame(scaled_x_train)
    standardized_test = pd.DataFrame(scaled_x_test)

    return standardized_train, standardized_test

现在训练决策树模型。

def train_decision_tree_model(X_train, y_train):
    """
    Train a decision tree model using scikit-learn.

    Args:
        X_train (array-like or sparse matrix of shape (n_samples, n_features)): The training input samples.
        y_train (array-like of shape (n_samples,)): The target values for training.

    Returns:
        sklearn.tree.DecisionTreeClassifier: The trained decision tree model.

    """
    # Create and train the decision tree model
    model = DecisionTreeClassifier(random_state=42)
    model.fit(X_train, y_train)

    return model

将所有上述步骤组合在一起并运行它们。

  X_train, X_test, y_train, y_test = generate_example_data(1000)
  X_train, X_test = standardize_features(X_train, X_test)
  model = train_decision_tree_model(X_train, y_train)

一旦我们有了数据和模型，我们初始化Focus。Focus接受多个参数进行定制。然而，为简单起见，我们在这个例子中只使用迭代次数和距离函数。

focus = Focus(
      num_iter=1000,
      distance_function="cosine",
  )

FOCUS的其他参数包括：

distance_function: str, optional (default="euclidean")
    Distance function - one of followings;
        - "euclidean"
        - "cosine"
        - "l1"
        - "mahalabobis"

optimizer: Keras optimizer, optional (default=tf.keras.optimizers.Adam())
    Optimizer for gradient decent

sigma: float, optional (default=10.0)
    Sigma hyperparameter value for hinge loss

temperature: float, optional (default=1.0)
    Temperature hyperparameter value for hinge loss

distance_weight: float, optional (default=0.01)
    Weight hyperparameter for distance loss

lr: float, optional (default=0.001)
    Learning rate for gradient descent optimization

num_iter: int, optional (default=100)
    Number of iterations for gradient descent optimization

direction: str, optional (default="both")
    Direction of perturbation (e.g. both, positive and negative)

hyperparameter_tuning: bool, optional (default=False)
    if True, generate method returns unchanged_ever and mean_distance

verbose: int, optional (default=1)
    Verbosity mode.
        - 0: silent
        - else: print current number of iterations

最后，我们可以使用FOCUS的generate方法生成反事实解释。

perturbed_feats = focus.generate(model, X_test, X_train)

我们可以在图表中查看生成的这些反事实解释。

def plot_pca(plot_df, focus_plot_df):
    """
    Plots the PCA-transformed features and corresponding predictions before and after applying FOCUS.

    Args:
        plot_df (pandas.DataFrame): A DataFrame containing the PCA-transformed features and
            predictions before applying FOCUS.
        focus_plot_df (pandas.DataFrame): A DataFrame containing the PCA-transformed features and
            predictions after applying FOCUS.

    Returns:
        None: This function displays the plot but does not return any value.
    """
    fig, axes = plt.subplots(1, 2, figsize=(20, 8))
    sns.scatterplot(
        data=focus_plot_df, x="pca1", y="pca2", hue="predictions", ax=axes[0]
    )
    axes[0].set_title("After applying FOCUS")
    sns.scatterplot(data=plot_df, x="pca1", y="pca2", hue="predictions", ax=axes[1])
    axes[1].set_title("Before applying FOCUS")
    fig.suptitle("Prediction Before and After FOCUS comparison")
    plt.show()


plot_df, focus_plot_df = prepare_plot_df(model, X_test, perturbed_feats)
plot_pca(plot_df, focus_plot_df)

它看起来像这样：

我们可以观察到，在应用FOCUS之前，许多预测等于1的点位于右侧，但在应用之后，它们变成了预测等于0。对于FOCUS之前预测等于0的点，它们在左侧，应用之后变成了预测等于1。

2.2. 超参数优化

主要有四个FOCUS的超参数，具体是sigma（方程1），temperature（方程4），distance weight（在距离损失和预测损失之间的权衡参数）和Adam的学习率[4]。

注意1：在本例中，我们将使用决策树模型，因此我们不会使用温度超参数。

注意2：你可以将优化算法（这里我们使用Adam）视为超参数，但出于简单起见，我们不会在本节优化它，同样适用于Adam的其他超参数，除了学习率。

本节使用Optuna来优化FOCUS的超参数。Optuna是一种强大的超参数优化工具，执行贝叶斯优化。除了Optuna，我们可以再次使用我们上面创建的相同函数；generate_example_data，standardize_features和train_decision_tree_model。

以下是目标函数。它定义了要调整的超参数以及要优化的内容。在本例中，我们在Focus类中调整3个超参数，即sigma，distance weight和Adam优化器的学习率。这些超参数的搜索空间由trial.suggest_float或trial.suggest_int定义。损失函数定义为cfe_distance / 100 + pow(unchanged_ever, 2)。其原因如函数的docstring中所述，我们希望优先考虑找到反事实解释，而不是最小化平均距离。因此，我们采用未更改实例的数量的平方。

注意：重要的是将Focus类的hyperparameter_tuning参数设置为True。否则，它不会返回未更改实例的数量和平均反事实解释距离。

import optuna
import tensorflow as tf

from cfxplorer import Focus


def objective(trial):
    """
    This function is an objective function for
    hyperparameter tuning using optuna.
    It explores the hyperparameter sets and evaluates the result on a
    given model and dataset

    Mean distance and number of unchanged instances are
    used for the evaluation.

    Args:
    trial (optuna.Trial):
    Object that contains information about the current trial,
    including hyperparameters.

    Returns:
    Mean CFE distance + number of unchanged instances squared -
    This is the objective function for hyperparameter optimization

    * Note: typically we want to minimise a number of unchanged first,
        so penalising the score by having squared number.
    Also, to not distort this objective,
    having the mean distance divided by 100.
    """
    X_train, X_test, y_train, y_test = generate_example_data(1000)
    X_train, X_test = standardize_features(X_train, X_test)
    model = train_decision_tree_model(X_train, y_train)

    focus = Focus(
        num_iter=1000,
        distance_function="euclidean",
        sigma=trial.suggest_int("sigma", 1, 20, step=1.0),
        temperature=0,  # DT models do not use temperature
        distance_weight=round(
            trial.suggest_float("distance_weight", 0.01, 0.1, step=0.01), 2
        ),
        lr=round(trial.suggest_float("lr", 0.001, 0.01, step=0.001), 3),
        optimizer=tf.keras.optimizers.Adam(),
        hyperparameter_tuning=True,
        verbose=0,
    )

    best_perturb, unchanged_ever, cfe_distance = focus.generate(model, X_test)

    print(f"Unchanged: {unchanged_ever}")
    print(f"Mean distance: {cfe_distance}")

    return cfe_distance / 100 + pow(unchanged_ever, 2)

一旦定义了目标函数，我们就可以开始调整这些超参数。

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=100)

    print(f"Number of finished trials: {len(study.trials)}")

    trial = study.best_trial

    print("Best trial:")
    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

在包存储库中可以找到更详细的例子。

https://github.com/kyosek/CFXplorer/tree/master/examples

3.限制

Focus类有一些限制。以下是其中的一些：

目前，Focus类仅适用于scikit-learn的DecisionTreeClassifier、RandomForestClassifier和AdaBoostClassifier。
虽然特征集中可能包含分类特征，但重要的是注意对分类特征进行更改的解释，例如从年龄40岁变为20岁，可能不会提供有意义的见解。
在应用Focus之前，输入特征应缩放到0和1之间的范围。因此，在使用Focus之前需要转换特征。然而，这个缩放过程可能会在应用Focus后解释特征时引入一些额外的复杂性。
计算成本随着给定模型的增大而增加。当你有一个庞大的模型时，可能无法执行代码。

4.结论

CFXplorer Python包提供了FOCUS算法的全面使用，以生成给定基于树的算法的反事实解释的最佳距离。尽管存在一些限制，这个包对于想要在基于树的模型中探索反事实结果的人应该是有用的。

本文通过FOCUS算法的理论背景、演示如何使用CFXplorer的代码示例以及一些当前的限制来介绍。在将来，我将向这个包中添加更多的反事实解释生成方法。

希望你会发现这篇文章有用。

5.参考文献

A. Lucic, H. Oosterhuis, H. Haned, and M. de Rijke. “FOCUS: Flexible optimizable counterfactual explanations for tree ensembles.” In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. 5. 2022, pp. 5313– 5322.

S. Wachter, B. Mittelstadt, and C. Russell. “Counterfactual explanations without opening the black box: Automated decisions and the GDPR.” In: Harv. JL & Tech. 31 (2017), p. 841.

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. “Optuna: A next-generation hyperparameter optimization framework.” In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019, pp. 2623–2631.

D. P. Kingma and J. Ba. “Adam: A method for stochastic optimization.” In: arXiv preprint arXiv:1412.6980 (2014).

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2024-06-06，如有侵权请联系 cloudcommunity@tencent.com 删除

算法