文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在scikit中创建自定义的评分函数--学习如何根据实例的各个属性对一组实例进行评分？

问如何在scikit中创建自定义的评分函数--学习如何根据实例的各个属性对一组实例进行评分？
EN

Stack Overflow用户

提问于 2018-01-26 19:18:19

回答 1查看 2K关注 0票数 1

我试图执行GridSearchCV来优化分类器的超参数，这应该通过优化自定义评分函数来完成。问题是，评分函数是按一定的成本分配的，即每个实例都不同(成本也是每个实例的一个特征)。如下面的示例所示，需要另一个数组test_amt来保存每个实例的成本(除了获取y和y_pred的“正常”评分函数之外)。

    def calculate_costs(y_test, y_test_pred, test_amt):
        cost = 0

        for i in range(1, len(y_test)):
            y = y_test.iloc[i]
            y_pred = y_test_pred.iloc[i]
            x_amt = test_amt.iloc[i]

            if y == 0 and y_pred == 0:
                cost -= x_amt * 1.1
            elif y == 0 and y_pred == 1:
                cost += x_amt
            elif y == 1 and y_pred == 0:
                cost += x_amt * 1.1
            elif y == 1 and y_pred == 1:
                cost += 0
            else:
                print("ERROR! No cost could be assigned to the instance: " + str(i))
        return cost

当我用这三个数组训练后调用这个函数时，它完美地计算了一个模型产生的总成本。但是，将其集成到GridSearchCV中是很困难的，因为评分函数只需要两个参数。虽然有可能将额外的kwargs传递给得分手，但我不知道如何传递依赖于GridSearchCV当前正在进行的拆分的子集。

到目前为止，我所想的/尝试过的：

将整个管道包装在一个具有全局存储的pandas.Series对象的类中，该对象用索引存储每个实例的成本。然后，理论上可以通过使用相同的索引来引用实例的成本。不幸的是，这并不像scikit那样工作--学习将所有东西转换成一个numpy数组。 def calculate_costs_class(y_test，y_test_pred)：索引成本=0，_ in y_test.iteritems()：y= y_test.locindex y_pred = y_test_pred.locindex x_amt = self.test_amt.locindex (y == 0和y_pred == 0: cost += (x_amt * (-1)) +5+ (x_amt * 0.1) # -revenue，+运费，+费用elif y == 0和y_pred == 1:成本+= x_amt #+收入elif y == 1和y_pred == 0:成本+= x_amt +5+ (x_amt * 0.1) +5#+收入，+运费，+费用，+收费成本elif y == 1和y_pred == 1: cost += 0# nothing nothing:print(“错误！不能将成本分配给实例：“+ str(index))返回成本
创建一个自定义PseudoInt类，即标签的数据类型，它继承了int中的所有属性，但也能够存储实例的成本(同时保留其用于应用逻辑操作的所有属性)。尽管即使这在Scikit学习之外也是有效的，但scikit学习中的check_classification_targets方法会引发一个ValueError:未知标签类型：“未知”错误。类PseudoInt(int)：def __new__(cls，x，成本，*args，**kwargs)：int.__new__= int.__new__(cls，x，*args，**kwargs) instance.cost =成本返回实例
我还没有尝试过，但是我想:由于成本也是实例集X中的一个特性，所以它也可以在Scikit的scorer.py中的_PredictScorer(_BaseScorer)类的__call__方法中使用。如果我将调用函数重新编程，并将成本数组作为X的子集传递给score_func，那么我也要付出代价。
或者:我可以自己实现一切。

有“更容易”的解决方案吗？

scikit-learn

decision-tree

scoring

python

numpy

Stack Overflow用户

回答已采纳

发布于 2018-01-27 15:35:03

我找到了一种解决这个问题的方法，方法是按照第二个建议的答案:将一个PseudoInteger传递给Scikit-学习，当与之比较或进行数学运算时，它的所有属性都与正常整数相同。但是，它也充当int的包装器，并且还可以存储实例变量(例如实例的成本)。正如问题中已经指出的，这导致Scikit-学习认识到传递的label数组中的值实际上是object而不是int类型。因此，我只是在第273行中替换了Scikit-Learn的type_of_target(y)方法中的测试，以返回‘二进制’，即使它没有通过测试。因此，Scikit-Learn只是把整个问题(应该是这样)当作二进制分类问题来处理。因此，type_of_target(y)方法中的multiclass.py中的第269-273行现在看起来如下所示：

# Invalid inputs
if y.ndim > 2 or (y.dtype == object and len(y) and
                  not isinstance(y.flat[0], string_types)):
    # return 'unknown'  # [[[1, 2]]] or [obj_1] and not ["label_1"]
    return 'binary' # Sneaky, modified to force binary classification.

然后，我的代码如下所示：

import sklearn
import sklearn.model_selection
import sklearn.base
import sklearn.metrics
import numpy as np
import sklearn.tree
import sklearn.feature_selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics.scorer import make_scorer


class PseudoInt(int):
    # Behaves like an integer, but is able to store instance variables
    pass


def grid_search(x, y_normal, x_amounts):
    # Change the label set to a np array containing pseudo ints with the costs associated with the instances
    y = np.empty(len(y_normal), dtype=PseudoInt)
    for index, value in y_normal.iteritems():
        new_int = PseudoInt(value)
        new_int.cost = x_amounts.loc[index]  # Here the cost is added to the label
        y[index] = new_int

    # Normal train test split
    x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

    # Classifier
    clf = sklearn.tree.DecisionTreeClassifier()

    # Custom scorer with the cost function below (lower cost is better)
    cost_scorer = make_scorer(cost_function, greater_is_better=False)  # Custom cost function (Lower cost is better)

    # Define pipeline
    pipe = Pipeline([('clf', clf)])

    # Grid search grid with any hyper parameters or other settings
    param_grid = [
        {'sfs__estimator__criterion': ['gini', 'entropy']}
    ]

    # Grid search and pass the custom scorer function
    gs = GridSearchCV(estimator=pipe,
                      param_grid=param_grid,
                      scoring=cost_scorer,
                      n_jobs=1,
                      cv=5,
                      refit=True)

    # run grid search and refit with best hyper parameters
    gs = gs.fit(x_train.as_matrix(), y_train)
    print("Best Parameters: " + str(gs.best_params_))
    print('Best Accuracy: ' + str(gs.best_score_))

    # Predict with retrained model (with best parameters)
    y_test_pred = gs.predict(x_test.as_matrix())

    # Get scores (also cost score)
    get_scores(y_test, y_test_pred)


def get_scores(y_test, y_test_pred):
    print("Getting scores")

    print("SCORES")
    precision = sklearn.metrics.precision_score(y_test, y_test_pred)
    recall = sklearn.metrics.recall_score(y_test, y_test_pred)
    f1_score = sklearn.metrics.f1_score(y_test, y_test_pred)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_test_pred)
    print("Precision      " + str(precision))
    print("Recall         " + str(recall))
    print("Accuracy       " + str(accuracy))
    print("F1_Score       " + str(f1_score))

    print("COST")
    cost = cost_function(y_test, y_test_pred)
    print("Cost Savings   " + str(-cost))

    print("CONFUSION MATRIX")
    cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_test_pred)
    cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
    print(cnf_matrix)


def cost_function(y_test, y_test_pred):
    """
    Calculates total cost based on TP, FP, TN, FN and the cost of a certain instance
    :param y_test: Has to be an array of PseudoInts containing the cost of each instance
    :param y_test_pred: Any array of PseudoInts or ints
    :return: Returns total cost
    """
    cost = 0

    for index in range(len(y_test)):
        # print(index)
        y = y_test[index]
        y_pred = y_test_pred[index]
        x_amt = y.cost

        if y == 0 and y_pred == 0:
            cost -= x_amt # Reducing cot by x_amt
        elif y == 0 and y_pred == 1:
            cost += x_amt  # Wrong classification adds cost
        elif y == 1 and y_pred == 0:
            cost += x_amt + 5 # Wrong classification adds cost and fee
        elif y == 1 and y_pred == 1:
            cost += 0  # No cost
        else:
            raise ValueError("No cost could be assigned to the instance: " + str(index))

    # print("Cost: " + str(cost))
    return cost

更新

现在，我没有直接更改包中的文件(有点脏)，而是添加到项目的第一个导入行中：

import sklearn.utils.multiclass

def return_binary(y):
    return "binary"

sklearn.utils.multiclass.type_of_target = return_binary

这将覆盖type_of_tartget(y)方法在sklearn.utils.multiclass中，以始终返回二进制。请注意，他必须在前面的所有其他滑雪-进口。

票数 1

查看全部 1 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48468115

复制

相似问题

问如何在scikit中创建自定义的评分函数--学习如何根据实例的各个属性对一组实例进行评分？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在scikit中创建自定义的评分函数--学习如何根据实例的各个属性对一组实例进行评分？EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在scikit中创建自定义的评分函数--学习如何根据实例的各个属性对一组实例进行评分？
EN