文章/答案/技术大牛

发布

社区首页 >问答首页 >管道模型的电火花模型解释

问管道模型的电火花模型解释
EN

Stack Overflow用户

提问于 2016-05-04 08:08:16

回答 1查看 3.4K关注 0票数 2

我使用管道模块在火花放电中实现DecisionTreeClassifier，因为我有几个特性工程步骤要在我的数据集上执行。代码类似于星火文档中的示例：

from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load the data stored in LIBSVM format as a DataFrame.
data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

问题是如何对此进行模型解释？管道模型对象没有与DecisionTree.trainClassifier类中的方法类似的方法DecisionTree.trainClassifier，而且我不能在管道中使用DecisionTree.trainClassifier，因为training ()将训练数据作为参数。

而管道在测试数据的fit()方法和transform()中接受训练数据作为参数

是否有一种方法可以使用管道而仍然执行模型解释&查找属性重要性？

decision-tree

apache-spark-mllib

apache-spark

pyspark

回答 1

Stack Overflow用户

发布于 2016-07-08 05:22:15

是的，我几乎在所有的模型解释中都使用了下面的方法。下面的行使用代码摘录中的命名约定。

dtm = model.stages[-1] # you estimator is the last stage in the pipeline
# hence the DecisionTreeClassifierModel will be the last transformer in the PipelineModel object 
dtm.explainParams()

现在您可以访问DecisionTreeClassifierModel的所有方法。所有可用的方法和属性都可以找到这里。在您的示例中没有测试代码。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37021964

复制

相似问题

问管道模型的电火花模型解释
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问管道模型的电火花模型解释EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问管道模型的电火花模型解释
EN