首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >解释R中xgboost的xgb.importance()函数返回的data.table

解释R中xgboost的xgb.importance()函数返回的data.table
EN

Stack Overflow用户
提问于 2017-11-28 00:17:09
回答 1查看 1.2K关注 0票数 2

我很难解释xgboost的xgb.importance()函数返回的data.table,如果能帮助我理解该表各列背后的含义和直觉,我将不胜感激。

为了让事情重现和具体化,我提供了以下代码:

代码语言:javascript
复制
library(data.table)
library(dplyr)
library(xgboost)

library(ISLR)

data(Auto)

Auto = Auto %>% mutate(

    origin = ifelse(origin == 2, 1, 0)

)

Auto = Auto %>% select(-name)

library(caTools)

split = sample.split(Auto$origin, SplitRatio = 0.80)

train = subset(Auto, split == TRUE)

test = subset(Auto, split == FALSE)

X_train = as.matrix(train %>% select(-origin))
X_test = as.matrix(test %>% select(-origin))
Y_train = train$origin
Y_test = test$origin

positive = sum(Y_train == 1)
negative = sum(Y_train == 0)
Total = length(Y_train)
weight = ifelse(Y_train == 1, Total/positive, Total/negative)


dtrain = xgb.DMatrix(data = X_train, label = Y_train )

dtest = xgb.DMatrix(data = X_test, label = Y_test)

model = xgb.train(data = dtrain, 

                                       verbose =2,  

                                       params = list(objective = "binary:logistic"), 

                                    weight = weight,

                                    nrounds = 20)

y_pred = predict(model, X_test)

table(y_pred > 0.5, Y_test)

important_variables = xgb.importance(model = model, feature_names = colnames(X_train), data = X_train, label = Y_train)

important_variables

dim(important_variables)

important_variable data.table的前几行如下:

代码语言:javascript
复制
  Feature   Split   Gain    Cover   Frequency   RealCover   RealCover %
displacement    121.5   0.132621660 0.057075548 0.015075377 17  0.31481481
displacement    190.5   0.096984485 0.106824987 0.050251256 17  0.31481481
displacement    128 0.069083692 0.093517155 0.045226131 28  0.51851852
weight  2931.5  0.054731622 0.034017383 0.015075377 9   0.16666667
mpg 30.75   0.036373687 0.015353348 0.010050251 44  0.81481481
acceleration    19.8    0.030658707 0.043746304 0.015075377 50  0.92592593
displacement    169.5   0.028471073 0.035860862 0.020100503 20  0.37037037
displacement    113.5   0.028467685 0.017729564 0.020100503 27  0.50000000
horsepower  59  0.028450597 0.022879182 0.025125628 22  0.40740741
weight  2670.5  0.028335853 0.020309028 0.010050251 6   0.11111111
acceleration    15.6    0.022315984 0.026517622 0.015075377 51  0.94444444
weight  1947.5  0.020687204 0.003763738 0.005025126 7   0.12962963
acceleration    14.75   0.018458042 0.013565059 0.010050251 53  0.98148148
acceleration    19.65   0.018395565 0.006194124 0.010050251 53  0.98148148

根据文档:

这些列包括:

功能 feature_names中提供的或已存在于模型转储中的功能的名称;

获得每个特征对模型的贡献。对于增强树模型,考虑每棵树的每个特征的每个增益,然后对每个特征进行平均,以给出整个模型的视觉。最高百分比表示预测用于训练的标签的重要特征(仅适用于树模型);

与此特征相关的观察值数量的Cover度量(仅适用于树模型);

Weight百分比,表示特征被放入树中的相对次数。

虽然FeatureGain有明显的含义,但CoverFrequencyRealCoverRealCover%列对我来说很难解释。

在表important_variables的第一行中,我们被告知displacement具有:

  • Split = 121.5
  • Cover = 0.13
  • Frequency = 0.05
  • RealCover = 0.015
  • RealCover% = 0.31

为了破解这些数字的含义,我运行了以下代码:

代码语言:javascript
复制
train %>% filter(displacement > 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))

Count   Frequency
190 0.6070288

代码语言:javascript
复制
train %>% filter(displacement > 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))

origin  Count   Frequency
0   183 0.58466454
1   7   0.02236422

代码语言:javascript
复制
train %>% filter(displacement < 121.5) %>% summarize(Count = n(), Frequency = Count/nrow(train))

Count   Frequency
123 0.3929712

代码语言:javascript
复制
train %>% filter(displacement < 121.5) %>% group_by(origin) %>% summarize(Count = n(), Frequency = Count/nrow(train))

origin  Count   Frequency
0   76  0.2428115
1   47  0.1501597

尽管如此,我仍然一无所知。

您的建议我们将不胜感激。

EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47515352

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档