首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >pyspark randomForest特性重要性:如何从列号中获取列名

pyspark randomForest特性重要性:如何从列号中获取列名
EN

Stack Overflow用户
提问于 2017-07-11 10:01:47
回答 2查看 6.2K关注 0票数 10

我在spark中使用标准的(字符串索引器+一个热编码器+ randomForest)管道,如下所示

代码语言:javascript
复制
labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data)

string_feature_indexers = [
   StringIndexer(inputCol=x, outputCol="int_{0}".format(x)).fit(data)
   for x in char_col_toUse_names
]

onehot_encoder = [
   OneHotEncoder(inputCol="int_"+x, outputCol="onehot_{0}".format(x))
   for x in char_col_toUse_names
]
all_columns = num_col_toUse_names + bool_col_toUse_names + ["onehot_"+x for x in char_col_toUse_names]
assembler = VectorAssembler(inputCols=[col for col in all_columns], outputCol="features")
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features", numTrees=100)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer] + string_feature_indexers + onehot_encoder + [assembler, rf, labelConverter])

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)
cvModel = crossval.fit(trainingData)

现在,在拟合之后,我可以使用cvModel.bestModel.stages[-2].featureImportances获得随机森林和特征重要性,但这并没有给出特征/列名,而只是给出了特征编号。

我得到的结果如下:

代码语言:javascript
复制
print(cvModel.bestModel.stages[-2].featureImportances)

(1446,[3,4,9,18,20,103,766,981,983,1098,1121,1134,1148,1227,1288,1345,1436,1444],[0.109898803421,0.0967396441648,4.24568235244e-05,0.0369705839109,0.0163489685127,3.2286694534e-06,0.0208192703688,0.0815822887175,0.0466903663708,0.0227619959989,0.0850922269211,0.000113388896956,0.0924779490403,0.163835022713,0.118987129392,0.107373548367,3.35577640585e-05,0.000229569946193])

如何将其映射回某些列名或列名+值格式?

基本上是为了获得随机森林的特征重要性以及列名。

EN

回答 2

Stack Overflow用户

发布于 2017-07-11 16:17:29

嘿,你为什么不通过列表扩展把它映射回原来的列呢?下面是一个示例:

代码语言:javascript
复制
# in your case: trainingData.columns 
data_frame_columns = ["A", "B", "C", "D", "E", "F"]
# in your case: print(cvModel.bestModel.stages[-2].featureImportances)
feature_importance = (1, [1, 3, 5], [0.5, 0.5, 0.5])

rf_output = [(data_frame_columns[i], feature_importance[2][j]) for i, j in zip(feature_importance[1], range(len(feature_importance[2])))]
dict(rf_output)

{'B': 0.5, 'D': 0.5, 'F': 0.5}
票数 1
EN

Stack Overflow用户

发布于 2018-01-30 15:28:17

在ml算法之后,我找不到任何方法来获得真正的列的初始列表,我正在使用这个作为当前的变通方法。

代码语言:javascript
复制
print(len(cols_now))

FEATURE_COLS=[]

for x in cols_now:

    if(x[-6:]!="catVar"):

        FEATURE_COLS+=[x]

    else:

        temp=trainingData.select([x[:-7],x[:-6]+"tmp"]).distinct().sort(x[:-6]+"tmp")

        temp_list=temp.select(x[:-7]).collect()

        FEATURE_COLS+=[list(x)[0] for x in temp_list]



print(len(FEATURE_COLS))

print(FEATURE_COLS)

我在所有的索引器(_tmp)和编码器(_catVar)上保持了一致的后缀命名,比如:

代码语言:javascript
复制
column_vec_in = str_col

column_vec_out = [col+"_catVar" for col in str_col]



indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp')

            for x in column_vec_in ]


encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y)

for x,y in zip(column_vec_in, column_vec_out)]



tmp = [[i,j] for i,j in zip(indexers, encoders)]

tmp = [i for sublist in tmp for i in sublist]

这可以进一步改进和推广,但目前这种繁琐的工作效果最好。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/45024192

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档