问为什么在XGBoost中指定可视化的功能名称列表时会出现"ValueError: feature_names mismatch“？
EN

Stack Overflow用户

提问于 2018-06-06 10:19:49

回答 1查看 5.1K关注 0票数 3

在XGBoost使用的内部数据结构中定义数据矩阵时，当我提到特征名称时，我得到了这个错误：

d_train = xgboost.DMatrix(X_train, label=y_train, feature_names=list(X))
d_test = xgboost.DMatrix(X_test, label=y_test, feature_names=list(X))
...
...
...
shap_values = shap.TreeExplainer(model).shap_values(X_train)
shap.summary_plot(shap_values, X_train)

ValueError                                Traceback (most recent call last)
<ipython-input-59-4635c450279d> in <module>()
----> 1 shap_values = shap.TreeExplainer(model).shap_values(X_train)
      2 shap.summary_plot(shap_values, X_train)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\explainers\tree.py in shap_values(self, X, **kwargs)
    104             if not str(type(X)).endswith("xgboost.core.DMatrix'>"):
    105                 X = xgboost.DMatrix(X)
--> 106             phi = self.trees.predict(X, pred_contribs=True)
    107         elif self.model_type == "lightgbm":
    108             phi = self.trees.predict(X, pred_contrib=True)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs)
   1042             option_mask |= 0x08
   1043 
-> 1044         self._validate_features(data)
   1045 
   1046         length = c_bst_ulong()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in _validate_features(self, data)
   1286 
   1287                 raise ValueError(msg.format(self.feature_names,
-> 1288                                             data.feature_names))
   1289 
   1290     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['Serial No', 'gender', 'Date', 'Product_Type', 'Product_Type', ... ... , 'Last_feature'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39']
<names of some features at column number corresponding to feature number in the following list> in input data
training data did not have the following fields: f7, f31, f33, f11, f6, f26, f2, f5, f17, f4, f37, f9, f1, f0, f39, f14, f12, f23, f13, f15, f22, f19, f35, f24, f38, f8, f28, f25, f20, f34, f27, f32, f36, f29, f16, f3, f21, f18, f30, f10

当我在定义DMatrix时没有指定特征名称时，我不会得到任何错误，并得到以下输出图形/绘图：

但我需要在绘图中显示特征的名称，而不是Feature 2、Feature 15等。为什么会出现这个错误，我该如何修复它？

如果你需要，这里有完整的代码，这基本上是我试图在this链接中复制可视化，但我的数据集和相应的自定义模型训练参数：

from sklearn.model_selection import train_test_split
import xgboost
import shap
import xlrd
import numpy as np
import matplotlib.pylab as pl

# print the JS visualization code to the notebook
shap.initjs()

import pandas as pd
data = pd.read_csv('InputCEM_FS_out.csv')
X = data.loc[:, data.columns != 'Score'] 
y = data['Score']
y = y/max(y)

# create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)


# Some of values are float or integer and some object. This is why we need to cast them:
from sklearn import preprocessing 
for f in X_train.columns: 
    if X_train[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(X_train[f].values)) 
        X_train[f] = lbl.transform(list(X_train[f].values))

for f in X_test.columns: 
    if X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder() 
        lbl.fit(list(X_test[f].values)) 
        X_test[f] = lbl.transform(list(X_test[f].values))

X_train.fillna((-999), inplace=True) 
X_test.fillna((-999), inplace=True)

X_train=np.array(X_train) 
X_test=np.array(X_test) 
X_train = X_train.astype(float) 
X_test = X_test.astype(float)

d_train = xgboost.DMatrix(X_train, label=y_train, feature_names=list(X)) # This gives the error later on. Remove the "feature_names=list(X)" part to not get the error.
d_test = xgboost.DMatrix(X_test, label=y_test, feature_names=list(X)) # This gives the error later on. Remove the "feature_names=list(X)" part to not get the error.

params = [
    ('max_depth', 3),
    ('eta', 0.025),
    ('objective', 'binary:logistic'),
    ('min_child_weight', 4),
    ('silent', 1),
    ('eval_metric', 'auc'),
    ('subsample', 0.75),
    ('colsample_bytree', 0.75),
    ('gamma', 0.75),
]

model = xgboost.train(params, d_train, 5000, evals = [(d_test, "test")], verbose_eval=100, early_stopping_rounds=20)

shap_values = shap.TreeExplainer(model).shap_values(X_train) # This line is what gives the error if the feature names are specified
shap.summary_plot(shap_values, X_train)

python

matplotlib

plot

visualization

xgboost

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-06 21:08:41

正如我们所看到的，问题是d_test的列被重命名为f7, f31,...)，而d_train的列没有被重命名。看起来，原因在这里：

shap_values = shap.TreeExplainer(model).shap_values(X_train)

传入X_train，而它只是一个没有列名的numpy数组(它们变成f31, f7，依此类推)。相反，请尝试传递包含所需列的DataFrame：

shap_values = shap.TreeExplainer(model).shap_values(pd.DataFrame(X_train, columns=X.columns))

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50711382

复制

相似问题

问为什么在XGBoost中指定可视化的功能名称列表时会出现"ValueError: feature_names mismatch“？
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么在XGBoost中指定可视化的功能名称列表时会出现"ValueError: feature_names mismatch“？EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么在XGBoost中指定可视化的功能名称列表时会出现"ValueError: feature_names mismatch“？
EN