首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在scikit中创建自定义的评分函数--学习如何根据实例的各个属性对一组实例进行评分?

如何在scikit中创建自定义的评分函数--学习如何根据实例的各个属性对一组实例进行评分?
EN

Stack Overflow用户
提问于 2018-01-26 19:18:19
回答 1查看 2K关注 0票数 1

我试图执行GridSearchCV来优化分类器的超参数,这应该通过优化自定义评分函数来完成。问题是,评分函数是按一定的成本分配的,即每个实例都不同(成本也是每个实例的一个特征)。如下面的示例所示,需要另一个数组test_amt来保存每个实例的成本(除了获取yy_pred的“正常”评分函数之外)。

代码语言:javascript
复制
    def calculate_costs(y_test, y_test_pred, test_amt):
        cost = 0

        for i in range(1, len(y_test)):
            y = y_test.iloc[i]
            y_pred = y_test_pred.iloc[i]
            x_amt = test_amt.iloc[i]

            if y == 0 and y_pred == 0:
                cost -= x_amt * 1.1
            elif y == 0 and y_pred == 1:
                cost += x_amt
            elif y == 1 and y_pred == 0:
                cost += x_amt * 1.1
            elif y == 1 and y_pred == 1:
                cost += 0
            else:
                print("ERROR! No cost could be assigned to the instance: " + str(i))
        return cost

当我用这三个数组训练后调用这个函数时,它完美地计算了一个模型产生的总成本。但是,将其集成到GridSearchCV中是很困难的,因为评分函数只需要两个参数。虽然有可能将额外的kwargs传递给得分手,但我不知道如何传递依赖于GridSearchCV当前正在进行的拆分的子集。

到目前为止,我所想的/尝试过的:

  1. 将整个管道包装在一个具有全局存储的pandas.Series对象的类中,该对象用索引存储每个实例的成本。然后,理论上可以通过使用相同的索引来引用实例的成本。不幸的是,这并不像scikit那样工作--学习将所有东西转换成一个numpy数组。 def calculate_costs_class(y_test,y_test_pred):索引成本=0,_ in y_test.iteritems():y= y_test.locindex y_pred = y_test_pred.locindex x_amt = self.test_amt.locindex (y == 0和y_pred == 0: cost += (x_amt * (-1)) +5+ (x_amt * 0.1) # -revenue,+运费,+费用elif y == 0和y_pred == 1:成本+= x_amt #+收入elif y == 1和y_pred == 0:成本+= x_amt +5+ (x_amt * 0.1) +5#+收入,+运费,+费用,+收费成本elif y == 1和y_pred == 1: cost += 0# nothing nothing:print(“错误!不能将成本分配给实例:“+ str(index))返回成本
  2. 创建一个自定义PseudoInt类,即标签的数据类型,它继承了int中的所有属性,但也能够存储实例的成本(同时保留其用于应用逻辑操作的所有属性)。尽管即使这在Scikit学习之外也是有效的,但scikit学习中的check_classification_targets方法会引发一个ValueError:未知标签类型:“未知”错误。 类PseudoInt(int):def __new__(cls,x,成本,*args,**kwargs):int.__new__= int.__new__(cls,x,*args,**kwargs) instance.cost =成本返回实例
  3. 我还没有尝试过,但是我想:由于成本也是实例集X中的一个特性,所以它也可以在Scikit的scorer.py中的_PredictScorer(_BaseScorer)类的__call__方法中使用。如果我将调用函数重新编程,并将成本数组作为X的子集传递给score_func,那么我也要付出代价。
  4. 或者:我可以自己实现一切。

有“更容易”的解决方案吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-01-27 15:35:03

我找到了一种解决这个问题的方法,方法是按照第二个建议的答案:将一个PseudoInteger传递给Scikit-学习,当与之比较或进行数学运算时,它的所有属性都与正常整数相同。但是,它也充当int的包装器,并且还可以存储实例变量(例如实例的成本)。正如问题中已经指出的,这导致Scikit-学习认识到传递的label数组中的值实际上是object而不是int类型。因此,我只是在第273行中替换了Scikit-Learn的type_of_target(y)方法中的测试,以返回‘二进制’,即使它没有通过测试。因此,Scikit-Learn只是把整个问题(应该是这样)当作二进制分类问题来处理。因此,type_of_target(y)方法中的multiclass.py中的第269-273行现在看起来如下所示:

代码语言:javascript
复制
# Invalid inputs
if y.ndim > 2 or (y.dtype == object and len(y) and
                  not isinstance(y.flat[0], string_types)):
    # return 'unknown'  # [[[1, 2]]] or [obj_1] and not ["label_1"]
    return 'binary' # Sneaky, modified to force binary classification.

然后,我的代码如下所示:

代码语言:javascript
复制
import sklearn
import sklearn.model_selection
import sklearn.base
import sklearn.metrics
import numpy as np
import sklearn.tree
import sklearn.feature_selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics.scorer import make_scorer


class PseudoInt(int):
    # Behaves like an integer, but is able to store instance variables
    pass


def grid_search(x, y_normal, x_amounts):
    # Change the label set to a np array containing pseudo ints with the costs associated with the instances
    y = np.empty(len(y_normal), dtype=PseudoInt)
    for index, value in y_normal.iteritems():
        new_int = PseudoInt(value)
        new_int.cost = x_amounts.loc[index]  # Here the cost is added to the label
        y[index] = new_int

    # Normal train test split
    x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

    # Classifier
    clf = sklearn.tree.DecisionTreeClassifier()

    # Custom scorer with the cost function below (lower cost is better)
    cost_scorer = make_scorer(cost_function, greater_is_better=False)  # Custom cost function (Lower cost is better)

    # Define pipeline
    pipe = Pipeline([('clf', clf)])

    # Grid search grid with any hyper parameters or other settings
    param_grid = [
        {'sfs__estimator__criterion': ['gini', 'entropy']}
    ]

    # Grid search and pass the custom scorer function
    gs = GridSearchCV(estimator=pipe,
                      param_grid=param_grid,
                      scoring=cost_scorer,
                      n_jobs=1,
                      cv=5,
                      refit=True)

    # run grid search and refit with best hyper parameters
    gs = gs.fit(x_train.as_matrix(), y_train)
    print("Best Parameters: " + str(gs.best_params_))
    print('Best Accuracy: ' + str(gs.best_score_))

    # Predict with retrained model (with best parameters)
    y_test_pred = gs.predict(x_test.as_matrix())

    # Get scores (also cost score)
    get_scores(y_test, y_test_pred)


def get_scores(y_test, y_test_pred):
    print("Getting scores")

    print("SCORES")
    precision = sklearn.metrics.precision_score(y_test, y_test_pred)
    recall = sklearn.metrics.recall_score(y_test, y_test_pred)
    f1_score = sklearn.metrics.f1_score(y_test, y_test_pred)
    accuracy = sklearn.metrics.accuracy_score(y_test, y_test_pred)
    print("Precision      " + str(precision))
    print("Recall         " + str(recall))
    print("Accuracy       " + str(accuracy))
    print("F1_Score       " + str(f1_score))

    print("COST")
    cost = cost_function(y_test, y_test_pred)
    print("Cost Savings   " + str(-cost))

    print("CONFUSION MATRIX")
    cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_test_pred)
    cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
    print(cnf_matrix)


def cost_function(y_test, y_test_pred):
    """
    Calculates total cost based on TP, FP, TN, FN and the cost of a certain instance
    :param y_test: Has to be an array of PseudoInts containing the cost of each instance
    :param y_test_pred: Any array of PseudoInts or ints
    :return: Returns total cost
    """
    cost = 0

    for index in range(len(y_test)):
        # print(index)
        y = y_test[index]
        y_pred = y_test_pred[index]
        x_amt = y.cost

        if y == 0 and y_pred == 0:
            cost -= x_amt # Reducing cot by x_amt
        elif y == 0 and y_pred == 1:
            cost += x_amt  # Wrong classification adds cost
        elif y == 1 and y_pred == 0:
            cost += x_amt + 5 # Wrong classification adds cost and fee
        elif y == 1 and y_pred == 1:
            cost += 0  # No cost
        else:
            raise ValueError("No cost could be assigned to the instance: " + str(index))

    # print("Cost: " + str(cost))
    return cost

更新

现在,我没有直接更改包中的文件(有点脏),而是添加到项目的第一个导入行中:

代码语言:javascript
复制
import sklearn.utils.multiclass

def return_binary(y):
    return "binary"

sklearn.utils.multiclass.type_of_target = return_binary

这将覆盖type_of_tartget(y)方法在sklearn.utils.multiclass中,以始终返回二进制。请注意,他必须在前面的所有其他滑雪-进口。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/48468115

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档