首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >PySpark ChiSqSelector p-值和测试统计量

PySpark ChiSqSelector p-值和测试统计量
EN

Stack Overflow用户
提问于 2018-06-21 15:11:54
回答 1查看 3.1K关注 0票数 0

我正在使用PySpark的pyspark.ml.feature.ChiSqSelector来执行特性选择。apps是包含稀疏矩阵的列,它对应于特定的name (机器)是否安装了特定的应用程序。总之,有21,615种可能有人可以安装的应用程序。

在使用ChiSqSelector对象拟合和转换新数据之后,我对selected_apps现在所代表的是什么感到困惑。这里的文档没有多大帮助。我有几个问题:

1)如何获得与21,615个输入应用程序相关联的卡方测试统计数据和p-值?通过查看dir(selector),这似乎无法立即访问。

2)为什么在selected_apps中显示不同的应用程序?我的预感是,下面第二行的机器没有应用程序0、1、2等,所以在selected_apps中为该行显示的是基于p值的前50个应用程序。这个API似乎与scikit有很大的不同--学习SelectKBest(chi2)的工作,它只返回最相关的k个特性,而不管特定的机器是否有该特性的"1“。

3)如何覆盖默认的numTopFeatures=50设置?这主要与问题1有关,并仅利用p值进行特征选择。对于基本上“忘记”这个参数,似乎没有numTopFeatures=-1-type选项。

代码语言:javascript
运行
复制
>>> selector = ChiSqSelector(
...     featuresCol='apps',
...     outputCol='selected_apps',
...     labelCol='multiple_event',
...     fpr=0.05
... )
>>> result = selector.fit(df).transform(df)                                                                
>>> print(result.show())
+---------------+-----------+--------------+--------------------+--------------------+
|           name|total_event|multiple_event|                apps|       selected_apps|
+---------------+-----------+--------------+--------------------+--------------------+
|000000000000021|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000022|          0|             0|(21615,[3,6,7,8,9...|(50,[3,6,7,8,9,11...|
|000000000000023|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000024|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000025|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000026|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000027|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000028|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000029|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000030|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000031|          0|             0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
|000000000000032|          0|             0|(21615,[6,7,8,9,1...|(50,[6,7,8,9,13,1...|
|000000000000033|          0|             0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
|000000000000034|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000035|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000036|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000037|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000038|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000039|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000040|          0|             0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
+---------------+-----------+--------------+--------------------+--------------------+
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-22 01:04:55

我想通了。解决办法如下:

代码语言:javascript
运行
复制
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.stat import Statistics

# Convert everything to a LabeledPoint object, the main consumption
# data structure for most of mllib
to_labeled_point = lambda x: LabeledPoint(x[0], Vectors.dense(x[1].toArray()))

obs = (
    df
    .select('multiple_event', 'apps')
    .rdd
    .map(to_labeled_point)
)

# The contingency table is constructed from an RDD of LabeledPoint and used to conduct
# the independence test. Returns an array containing the ChiSquaredTestResult for every feature
# against the label.
feature_test_results = Statistics.chiSqTest(obs)

data = []

for idx, result in enumerate(feature_test_results):
    row = {
        'feature_index': idx,
        'p_value': result.pValue,
        'statistic': result.statistic,
        'degrees_of_freedom': result.degreesOfFreedom
    }
    data.append(row)
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50971964

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档