我试图执行GridSearchCV来优化分类器的超参数,这应该通过优化自定义评分函数来完成。问题是,评分函数是按一定的成本分配的,即每个实例都不同(成本也是每个实例的一个特征)。如下面的示例所示,需要另一个数组test_amt来保存每个实例的成本(除了获取y和y_pred的“正常”评分函数之外)。
def calculate_costs(y_test, y_test_pred, test_amt):
cost = 0
for i in range(1, len(y_test)):
y = y_test.iloc[i]
y_pred = y_test_pred.iloc[i]
x_amt = test_amt.iloc[i]
if y == 0 and y_pred == 0:
cost -= x_amt * 1.1
elif y == 0 and y_pred == 1:
cost += x_amt
elif y == 1 and y_pred == 0:
cost += x_amt * 1.1
elif y == 1 and y_pred == 1:
cost += 0
else:
print("ERROR! No cost could be assigned to the instance: " + str(i))
return cost当我用这三个数组训练后调用这个函数时,它完美地计算了一个模型产生的总成本。但是,将其集成到GridSearchCV中是很困难的,因为评分函数只需要两个参数。虽然有可能将额外的kwargs传递给得分手,但我不知道如何传递依赖于GridSearchCV当前正在进行的拆分的子集。
到目前为止,我所想的/尝试过的:
pandas.Series对象的类中,该对象用索引存储每个实例的成本。然后,理论上可以通过使用相同的索引来引用实例的成本。不幸的是,这并不像scikit那样工作--学习将所有东西转换成一个numpy数组。
def calculate_costs_class(y_test,y_test_pred):索引成本=0,_ in y_test.iteritems():y= y_test.locindex y_pred = y_test_pred.locindex x_amt = self.test_amt.locindex (y == 0和y_pred == 0: cost += (x_amt * (-1)) +5+ (x_amt * 0.1) # -revenue,+运费,+费用elif y == 0和y_pred == 1:成本+= x_amt #+收入elif y == 1和y_pred == 0:成本+= x_amt +5+ (x_amt * 0.1) +5#+收入,+运费,+费用,+收费成本elif y == 1和y_pred == 1: cost += 0# nothing nothing:print(“错误!不能将成本分配给实例:“+ str(index))返回成本X中的一个特性,所以它也可以在Scikit的scorer.py中的_PredictScorer(_BaseScorer)类的__call__方法中使用。如果我将调用函数重新编程,并将成本数组作为X的子集传递给score_func,那么我也要付出代价。有“更容易”的解决方案吗?
发布于 2018-01-27 15:35:03
我找到了一种解决这个问题的方法,方法是按照第二个建议的答案:将一个PseudoInteger传递给Scikit-学习,当与之比较或进行数学运算时,它的所有属性都与正常整数相同。但是,它也充当int的包装器,并且还可以存储实例变量(例如实例的成本)。正如问题中已经指出的,这导致Scikit-学习认识到传递的label数组中的值实际上是object而不是int类型。因此,我只是在第273行中替换了Scikit-Learn的type_of_target(y)方法中的测试,以返回‘二进制’,即使它没有通过测试。因此,Scikit-Learn只是把整个问题(应该是这样)当作二进制分类问题来处理。因此,type_of_target(y)方法中的multiclass.py中的第269-273行现在看起来如下所示:
# Invalid inputs
if y.ndim > 2 or (y.dtype == object and len(y) and
not isinstance(y.flat[0], string_types)):
# return 'unknown' # [[[1, 2]]] or [obj_1] and not ["label_1"]
return 'binary' # Sneaky, modified to force binary classification.然后,我的代码如下所示:
import sklearn
import sklearn.model_selection
import sklearn.base
import sklearn.metrics
import numpy as np
import sklearn.tree
import sklearn.feature_selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics.scorer import make_scorer
class PseudoInt(int):
# Behaves like an integer, but is able to store instance variables
pass
def grid_search(x, y_normal, x_amounts):
# Change the label set to a np array containing pseudo ints with the costs associated with the instances
y = np.empty(len(y_normal), dtype=PseudoInt)
for index, value in y_normal.iteritems():
new_int = PseudoInt(value)
new_int.cost = x_amounts.loc[index] # Here the cost is added to the label
y[index] = new_int
# Normal train test split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
# Classifier
clf = sklearn.tree.DecisionTreeClassifier()
# Custom scorer with the cost function below (lower cost is better)
cost_scorer = make_scorer(cost_function, greater_is_better=False) # Custom cost function (Lower cost is better)
# Define pipeline
pipe = Pipeline([('clf', clf)])
# Grid search grid with any hyper parameters or other settings
param_grid = [
{'sfs__estimator__criterion': ['gini', 'entropy']}
]
# Grid search and pass the custom scorer function
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring=cost_scorer,
n_jobs=1,
cv=5,
refit=True)
# run grid search and refit with best hyper parameters
gs = gs.fit(x_train.as_matrix(), y_train)
print("Best Parameters: " + str(gs.best_params_))
print('Best Accuracy: ' + str(gs.best_score_))
# Predict with retrained model (with best parameters)
y_test_pred = gs.predict(x_test.as_matrix())
# Get scores (also cost score)
get_scores(y_test, y_test_pred)
def get_scores(y_test, y_test_pred):
print("Getting scores")
print("SCORES")
precision = sklearn.metrics.precision_score(y_test, y_test_pred)
recall = sklearn.metrics.recall_score(y_test, y_test_pred)
f1_score = sklearn.metrics.f1_score(y_test, y_test_pred)
accuracy = sklearn.metrics.accuracy_score(y_test, y_test_pred)
print("Precision " + str(precision))
print("Recall " + str(recall))
print("Accuracy " + str(accuracy))
print("F1_Score " + str(f1_score))
print("COST")
cost = cost_function(y_test, y_test_pred)
print("Cost Savings " + str(-cost))
print("CONFUSION MATRIX")
cnf_matrix = sklearn.metrics.confusion_matrix(y_test, y_test_pred)
cnf_matrix = cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]
print(cnf_matrix)
def cost_function(y_test, y_test_pred):
"""
Calculates total cost based on TP, FP, TN, FN and the cost of a certain instance
:param y_test: Has to be an array of PseudoInts containing the cost of each instance
:param y_test_pred: Any array of PseudoInts or ints
:return: Returns total cost
"""
cost = 0
for index in range(len(y_test)):
# print(index)
y = y_test[index]
y_pred = y_test_pred[index]
x_amt = y.cost
if y == 0 and y_pred == 0:
cost -= x_amt # Reducing cot by x_amt
elif y == 0 and y_pred == 1:
cost += x_amt # Wrong classification adds cost
elif y == 1 and y_pred == 0:
cost += x_amt + 5 # Wrong classification adds cost and fee
elif y == 1 and y_pred == 1:
cost += 0 # No cost
else:
raise ValueError("No cost could be assigned to the instance: " + str(index))
# print("Cost: " + str(cost))
return cost更新
现在,我没有直接更改包中的文件(有点脏),而是添加到项目的第一个导入行中:
import sklearn.utils.multiclass
def return_binary(y):
return "binary"
sklearn.utils.multiclass.type_of_target = return_binary这将覆盖type_of_tartget(y)方法在sklearn.utils.multiclass中,以始终返回二进制。请注意,他必须在前面的所有其他滑雪-进口。
https://stackoverflow.com/questions/48468115
复制相似问题