我目前正在使用pyspark和伟大的语言游戏数据集,它包含几个样本作为json对象,如下所示。
这些样本中的每一个都代表了游戏的一个实例,其中某个人听过带有某种口语的音频文件,然后应该从她刚刚听到的四种可能的语言中选择一种。
现在我想在“目标”字段和“猜测”字段上统计所有这些游戏,然后计算每对游戏的数量(“目标”,“猜测”)。有人能给我一些帮助吗?如何完成这项工作?
我已经看过pyspark documentation,但由于我对python/pyspark还很陌生,所以它并不真正理解聚合函数是如何工作的。
{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
 "choices": ["Hindi", "Lao", "Maltese", "Turkish"],
 "guess": "Maltese", "date": "2013-08-19", "country": "AU"} 发布于 2019-05-17 20:44:12
将json数据转换为pyspark dataframe的过程可以这样完成。
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
import json
sc = SparkContext(conf=SparkConf())
sqlContext = SQLContext(sc)
def convert_single_object_per_line(json_list):
    json_string = ""
    for line in json_list:
        json_string += json.dumps(line) + "\n"
    return json_string
json_list = [{"target": "Turkish", "sample": "af0e25c7637fb0dcdc56fac6d49aa55e",
 "choices": ["Hindi", "Lao", "Maltese", "Turkish"],
 "guess": "Maltese", "date": "2013-08-19", "country": "AU"}]
json_string = convert_single_object_per_line(json_list)
df = sqlContext.createDataFrame([json.loads(line) for line in json_string.splitlines()])
[In]:df
[Out]:
DataFrame[choices: array<string>, country: string, date: string, guess: string, sample: string, target: string]
[In]:df.show()
[Out]:
+--------------------+-------+----------+-------+--------------------+-------+
|             choices|country|      date|  guess|              sample| target|
+--------------------+-------+----------+-------+--------------------+-------+
|[Hindi, Lao, Malt...|     AU|2013-08-19|Maltese|af0e25c7637fb0dcd...|Turkish|
+--------------------+-------+----------+-------+--------------------+-------+https://stackoverflow.com/questions/56183633
复制相似问题