我很难解释xgboost的xgb.importance()函数返回的data.table,如果能帮助我理解该表各列背后的含义和直觉,我将不胜感激。
为了让事情重现和具体化,我提供了以下代码:
library(data.table)
library(dplyr)
library(xgboost)
library(ISLR)
data(Auto)
Auto = Auto %>% mutate(
origin = ifelse(origin == 2, 1, 0)
)
Auto = Auto %>% select(-name)
library(caTools)
split = sample.split(Auto$origin, SplitRatio = 0.80)
train = subset(Auto, split == TRUE)
test = subset(Auto, split == FALSE)
X_train = as.matrix(train %>% select(-origin))
X_test = as.matrix(test %>% select(-origin))
Y_train = train$origin
Y_test = test$origin
positive = sum(Y_train == 1)
negative = sum(Y_train == 0)
Total = length(Y_train)
weight = ifelse(Y_train == 1, Total/positive, Total/negative)
dtrain = xgb.DMatrix(data = X_train, label = Y_train )
dtest = xgb.DMatrix(data = X_test, label = Y_test)
model = xgb.train(data = dtrain,
verbose =2,
params = list(objective = "binary:logistic"),
weight = weight,
nrounds = 20)
y_pred = predict(model, X_test)
table(y_pred > 0.5, Y_test)
important_variables = xgb.importance(model = model, feature_names = colnames(X_train), data = X_train, label = Y_train)
important_variables
dim(important_variables)
important_variable data.table的前几行如下:
Feature Split Gain Cover Frequency RealCover RealCover %
displacement 121.5 0.132621660 0.057075548 0.015075377 17 0.31481481
displacement 190.5 0.096984485 0.106824987 0.050251256 17 0.31481481
displacement 128 0.069083692 0.093517155 0.045226131 28 0.51851852
weight 2931.5 0.054731622 0.034017383 0.015075377 9 0.16666667
mpg 30.75 0.036373687 0.015353348 0.010050251 44 0.81481481
acceleration 19.8 0.030658707 0.043746304 0.015075377 50 0.92592593
displacement 169.5 0.028471073 0.035860862 0.020100503 20 0.37037037
displacement 113.5 0.028467685 0.017729564 0.020100503 27 0.50000000
horsepower 59 0.028450597 0.022879182 0.025125628 22 0.40740741
weight 2670.5 0.028335853 0.020309028 0.010050251 6 0.11111111
acceleration 15.6 0.022315984 0.026517622 0.015075377 51 0.94444444
weight 1947.5 0.020687204 0.003763738 0.005025126 7 0.12962963
acceleration 14.75 0.018458042 0.013565059 0.010050251 53 0.98148148
acceleration 19.65 0.018395565 0.006194124 0.010050251 53 0.98148148
根据文档:
这些列包括:
功能 feature_names中提供的或已存在于模型转储中的功能的名称;
获得每个特征对模型的贡献。对于增强树模型,考虑每棵树的每个特征的每个增益,然后对每个特征进行平均,以给出整个模型的视觉。最高百分比表示预测用于训练的标签的重要特征(仅适用于树模型);
与此特征相关的观察值数量的Cover度量(仅适用于树模型);
Weight百分比,表示特征被放入树中的相对次数。
虽然Feature和Gain有明显的含义,但Cover、Frequency、RealCover和RealCover%列对我来说很难解释。
在表important_variables的第一行中,我们被告知displacement具有:
为了破解这些数字的含义,我运行了以下代码:
train %>% filter(displacement > 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))
Count Frequency
190 0.6070288
train %>% filter(displacement > 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))
origin Count Frequency
0 183 0.58466454
1 7 0.02236422
train %>% filter(displacement < 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))
Count Frequency
123 0.3929712
train %>% filter(displacement < 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))
origin Count Frequency
0 76 0.2428115
1 47 0.1501597
尽管如此,我仍然一无所知。
您的建议我们将不胜感激。
发布于 2017-11-28 00:36:24
频率是涉及特定特征的拆分的百分比,相对于所做的每次拆分。您可以通过观察所有变量的频率总和为1来进行健全性检查。
sum(important_variables$Frequency)
[1] 1
它显示选择了多少次要分割的要素。虽然不像Gain
那样复杂,但这也可以用作可变重要性度量。
这也解释了为什么您不能通过对训练数据执行汇总操作来获得相同的频率数字:它是在经过训练的xgboost模型上计算的,而不是数据。
Cover
和它的派生函数就不那么简单了。有关详细答案,请参阅this question的答案。
https://stackoverflow.com/questions/47515352
复制相似问题