问如何在PySpark中使用自定义的行分组进行reduceByKey？
EN

Stack Overflow用户

提问于 2019-05-22 17:31:28

回答 2查看 450关注 0票数 1

我有一个数据帧，如下所示：

items_df
======================================================
| customer   item_type    brand    price    quantity |  
|====================================================|
|  1         bread        reems     20         10    |  
|  2         butter       spencers  10         21    |  
|  3         jam          niles     10         22    |
|  1         bread        marks     16         18    |
|  1         butter       jims      19         12    |
|  1         jam          jills     16         6     |
|  2         bread        marks     16         18    |
======================================================

我创建了一个rdd，将上面的代码转换为字典：

rdd = items_df.rdd.map(lambda row: row.asDict())

结果如下所示：

[
   { "customer": 1, "item_type": "bread", "brand": "reems", "price": 20, "quantity": 10 },
   { "customer": 2, "item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21 },
   { "customer": 3, "item_type": "jam", "brand": "niles", "price": 10, "quantity": 22 },
   { "customer": 1, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 },
   { "customer": 1, "item_type": "butter", "brand": "jims", "price": 19, "quantity": 12 },
   { "customer": 1, "item_type": "jam", "brand": "jills", "price": 16, "quantity": 6 },
   { "customer": 2, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 }
]

我想先按客户对上面的行进行分组。然后，我想介绍自定义的新关键字“面包”，“巴特斯”，“果酱”，并为该客户分组所有这些行。所以我的rdd从7行减少到3行。

输出将如下所示：

[
    { 
        "customer": 1, 
        "breads": [
            {"item_type": "bread", "brand": "reems", "price": 20, "quantity": 10},
            {"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18},
        ],
        "butters": [
            {"item_type": "butter", "brand": "jims", "price": 19, "quantity": 12}
        ],
        "jams": [
            {"item_type": "jam", "brand": "jills", "price": 16, "quantity": 6}
        ]
    },
    {
        "customer": 2,
        "breads": [
            {"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18}
        ],
        "butters": [
            {"item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21}
        ],
        "jams": []
    },
    {
        "customer": 3,
        "breads": [],
        "butters": [],
        "jams": [
            {"item_type": "jam", "brand": "niles", "price": 10, "quantity": 22}
        ]
    }
]

有人知道如何使用PySpark实现上述功能吗？我想知道是否有使用reduceByKey()或类似的解决方案。如果可能的话，我希望避免使用groupByKey()。

apache-spark

pyspark

apache-spark-sql

rdd

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56253622

复制

相似问题

问如何在PySpark中使用自定义的行分组进行reduceByKey？
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在PySpark中使用自定义的行分组进行reduceByKey？EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在PySpark中使用自定义的行分组进行reduceByKey？
EN