我很难解释xgboost的xgb.importance()函数返回的data.table,如果能帮助我理解该表各列背后的含义和直觉,我将不胜感激。
为了让事情重现和具体化,我提供了以下代码:
library(data.table)
library(dplyr)
library(xgboost)
library(ISLR)
data(Auto)
Auto = Auto %>% mutate(
origin = ifelse(origin == 2, 1, 0)
)
Auto = Auto %>% select(-name)
library(caTools)
split = sample.split(Auto$origin, SplitRatio = 0.80)
train = subset(Auto, split == TRUE)
test = subset(Auto, split == FALSE)
X_train = as.matrix(train %>% select(-origin))
X_test = as.matrix(test %>% select(-origin))
Y_train = train$origin
Y_test = test$origin
positive = sum(Y_train == 1)
negative = sum(Y_train == 0)
Total = length(Y_train)
weight = ifelse(Y_train == 1, Total/positive, Total/negative)
dtrain = xgb.DMatrix(data = X_train, label = Y_train )
dtest = xgb.DMatrix(data = X_test, label = Y_test)
model = xgb.train(data = dtrain,
verbose =2,
params = list(objective = "binary:logistic"),
weight = weight,
nrounds = 20)
y_pred = predict(model, X_test)
table(y_pred > 0.5, Y_test)
important_variables = xgb.importance(model = model, feature_names = colnames(X_train), data = X_train, label = Y_train)
important_variables
dim(important_variables)
important_variable data.table的前几行如下:
Feature Split Gain Cover Frequency RealCover RealCover %
displacement 121.5 0.132621660 0.057075548 0.015075377 17 0.31481481
displacement 190.5 0.096984485 0.106824987 0.050251256 17 0.31481481
displacement 128 0.069083692 0.093517155 0.045226131 28 0.51851852
weight 2931.5 0.054731622 0.034017383 0.015075377 9 0.16666667
mpg 30.75 0.036373687 0.015353348 0.010050251 44 0.81481481
acceleration 19.8 0.030658707 0.043746304 0.015075377 50 0.92592593
displacement 169.5 0.028471073 0.035860862 0.020100503 20 0.37037037
displacement 113.5 0.028467685 0.017729564 0.020100503 27 0.50000000
horsepower 59 0.028450597 0.022879182 0.025125628 22 0.40740741
weight 2670.5 0.028335853 0.020309028 0.010050251 6 0.11111111
acceleration 15.6 0.022315984 0.026517622 0.015075377 51 0.94444444
weight 1947.5 0.020687204 0.003763738 0.005025126 7 0.12962963
acceleration 14.75 0.018458042 0.013565059 0.010050251 53 0.98148148
acceleration 19.65 0.018395565 0.006194124 0.010050251 53 0.98148148
根据文档:
这些列包括:
功能 feature_names中提供的或已存在于模型转储中的功能的名称;
获得每个特征对模型的贡献。对于增强树模型,考虑每棵树的每个特征的每个增益,然后对每个特征进行平均,以给出整个模型的视觉。最高百分比表示预测用于训练的标签的重要特征(仅适用于树模型);
与此特征相关的观察值数量的Cover度量(仅适用于树模型);
Weight百分比,表示特征被放入树中的相对次数。
虽然Feature和Gain有明显的含义,但Cover、Frequency、RealCover和RealCover%列对我来说很难解释。
在表important_variables的第一行中,我们被告知displacement具有:
为了破解这些数字的含义,我运行了以下代码:
train %>% filter(displacement > 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))
Count Frequency
190 0.6070288
train %>% filter(displacement > 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))
origin Count Frequency
0 183 0.58466454
1 7 0.02236422
train %>% filter(displacement < 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))
Count Frequency
123 0.3929712
train %>% filter(displacement < 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))
origin Count Frequency
0 76 0.2428115
1 47 0.1501597
尽管如此,我仍然一无所知。
您的建议我们将不胜感激。
https://stackoverflow.com/questions/47515352
复制相似问题