\n", flat_rdd_test)
会发现比原始数据少了一层tuple的嵌套,输出为:
[(10,1,2,3), (10,1,2,4), (10,1,2,4), (20,2,2,2), (20,1,2,3...)]
3.filter()
一般是依据括号中的一个布尔型表达式,来筛选出满足为真的元素
pyspark.RDD.filter
# the example of filter
key1_rdd...n",key1_rdd.collect())
print("filter_2\n",key2_rdd.collect())
输出为:
[(10,1,2,3), (10,1,2,4), (10,1,2,4...pyspark.RDD.groupBy
# the example of groupBy
# 我们可以先定义一个具名函数
def return_group_key(x):
seq = x[1:]..._1.mapValues(list).collect())
明文输出为:
[('small', [(10,1,2,3), (20,2,2,2), (20,1,2,3)]), ('big', [(10,1,2,4