在Pandas中使用idxmax()
函数时,我一直收到这个错误。
Traceback (most recent call last):
File "/Users/username/College/year-4/fyp-credit-card-fraud/code/main.py", line 20, in <module>
best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)
File "/Users/username/College/year-4/fyp-credit-card-fraud/code/Classification.py", line 39, in print_kfold_scores
best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 1369, in idxmax
i = nanops.nanargmax(_values_from_object(self), skipna=skipna)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/nanops.py", line 74, in _f
raise TypeError(msg.format(name=f.__name__.replace('nan', '')))
TypeError: reduction operation 'argmax' not allowed for this dtype
我使用的Pandas版本是0.22.0
main.py
import ExploratoryDataAnalysis as eda
import Preprocessing as processor
import Classification as classify
import pandas as pd
data_path = '/Users/username/college/year-4/fyp-credit-card-fraud/data/'
if __name__ == '__main__':
df = pd.read_csv(data_path + 'creditcard.csv')
# eda.init(df)
# eda.check_null_values()
# eda.view_data()
# eda.check_target_classes()
df = processor.noramlize(df)
X_training, X_testing, y_training, y_testing, X_training_undersampled, X_testing_undersampled, \
y_training_undersampled, y_testing_undersampled = processor.resample(df)
best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)
Classification.py
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, \
roc_auc_score, roc_curve, recall_score, classification_report
import pandas as pd
import numpy as np
def print_kfold_scores(X_training, y_training):
print('\nKFold\n')
fold = KFold(len(y_training), 5, shuffle=False)
c_param_range = [0.01, 0.1, 1, 10, 100]
results = pd.DataFrame(index=range(len(c_param_range), 2), columns=['C_parameter', 'Mean recall score'])
results['C_parameter'] = c_param_range
j = 0
for c_param in c_param_range:
print('-------------------------------------------')
print('C parameter: ', c_param)
print('\n-------------------------------------------')
recall_accs = []
for iteration, indices in enumerate(fold, start=1):
lr = LogisticRegression(C=c_param, penalty='l1')
lr.fit(X_training.iloc[indices[0], :], y_training.iloc[indices[0], :].values.ravel())
y_prediction_undersampled = lr.predict(X_training.iloc[indices[1], :].values)
recall_acc = recall_score(y_training.iloc[indices[1], :].values, y_prediction_undersampled)
recall_accs.append(recall_acc)
print('Iteration ', iteration, ': recall score = ', recall_acc)
results.ix[j, 'Mean recall score'] = np.mean(recall_accs)
j += 1
print('\nMean recall score ', np.mean(recall_accs))
print('\n')
best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter'] # Error occurs on this line
print('*****************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c_param)
print('*****************************************************************')
return best_c_param
导致问题的代码行如下
best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']
程序的输出如下所示
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/username/College/year-4/fyp-credit-card-fraud/code/main.py
/Users/username/Library/Python/3.6/lib/python/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Dataset Ratios
Percentage of genuine transactions: 0.5
Percentage of fraudulent transactions 0.5
Total number of transactions in resampled data: 984
Whole Dataset Split
Number of transactions in training dataset: 199364
Number of transactions in testing dataset: 85443
Total number of transactions in dataset: 284807
Undersampled Dataset Split
Number of transactions in training dataset 688
Number of transactions in testing dataset: 296
Total number of transactions in dataset: 984
KFold
-------------------------------------------
C parameter: 0.01
-------------------------------------------
Iteration 1 : recall score = 0.931506849315
Iteration 2 : recall score = 0.917808219178
Iteration 3 : recall score = 1.0
Iteration 4 : recall score = 0.959459459459
Iteration 5 : recall score = 0.954545454545
Mean recall score 0.9526639965
-------------------------------------------
C parameter: 0.1
-------------------------------------------
Iteration 1 : recall score = 0.849315068493
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.915254237288
Iteration 4 : recall score = 0.945945945946
Iteration 5 : recall score = 0.909090909091
Mean recall score 0.89652397189
-------------------------------------------
C parameter: 1
-------------------------------------------
Iteration 1 : recall score = 0.86301369863
Iteration 2 : recall score = 0.86301369863
Iteration 3 : recall score = 0.983050847458
Iteration 4 : recall score = 0.945945945946
Iteration 5 : recall score = 0.924242424242
Mean recall score 0.915853322981
-------------------------------------------
C parameter: 10
-------------------------------------------
Iteration 1 : recall score = 0.849315068493
Iteration 2 : recall score = 0.876712328767
Iteration 3 : recall score = 0.983050847458
Iteration 4 : recall score = 0.945945945946
Iteration 5 : recall score = 0.939393939394
Mean recall score 0.918883626012
-------------------------------------------
C parameter: 100
-------------------------------------------
Iteration 1 : recall score = 0.86301369863
Iteration 2 : recall score = 0.876712328767
Iteration 3 : recall score = 0.983050847458
Iteration 4 : recall score = 0.945945945946
Iteration 5 : recall score = 0.924242424242
Mean recall score 0.918593049009
Traceback (most recent call last):
File "/Users/username/College/year-4/fyp-credit-card-fraud/code/main.py", line 20, in <module>
best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)
File "/Users/username/College/year-4/fyp-credit-card-fraud/code/Classification.py", line 39, in print_kfold_scores
best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 1369, in idxmax
i = nanops.nanargmax(_values_from_object(self), skipna=skipna)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/nanops.py", line 74, in _f
raise TypeError(msg.format(name=f.__name__.replace('nan', '')))
TypeError: reduction operation 'argmax' not allowed for this dtype
Process finished with exit code 1
发布于 2019-02-16 00:29:45
默认情况下,单元格值的类型不是数字。argmin()
、idxmin()
、argmax()
和其它类似的函数要求数据类型是数字的。
最简单的解决方案是使用pd.to_numeric()
将序列(或列)转换为数值类型。具有列'a'
的数据帧df
的示例如下:
df['a'] = pd.to_numeric(df['a'])
关于pandas上的类型转换的更完整的答案可以在here找到。
希望这能有所帮助:)
发布于 2018-03-19 04:42:03
#best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
我们应该替换这行代码
主要问题是:
1) "mean recall score“的类型是object,不能使用"idxmax()”来计算该值。2)需要将"mean recall score“从"object”改为"float“。3)可以使用apply(pd.to_numeric,errors =‘强制’,axis = 0)来计算。
best_c = results_table
best_c.dtypes.eq(object) # you can see the type of best_c
new = best_c.columns[best_c.dtypes.eq(object)] #get the object column of the best_c
best_c[new] = best_c[new].apply(pd.to_numeric, errors = 'coerce', axis=0) # change the type of object
best_c
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter'] #calculate the mean values
发布于 2019-02-08 04:12:49
简而言之,试试这个
best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']
而不是
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
https://stackoverflow.com/questions/48719937
复制相似问题