选择时,如何在struct中获得聚合列?
下面是输入数据集-
+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+
|batch_id|batch_run_id |data_subject_cd|batch_frequency|target_object|Job_Run_Upd_Gmt_Ts|source |column_name |column_datatype|table_row_count|
+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+
|3033 |202101052030043000|employee |D |employee |20210105213443 |source1|employee_id |bigint |100 |
|3033 |202101062030043000|employee |D |employee |20210106211003 |source1|employee_name |varchar |100 |
|3033 |202101101830043000|employee |D |employee |20210110185509 |source1|employee_department_id|bigint |100 |
|3033 |202101034670052000|employee |D |employee |20210430445670 |source1|employee_location_id |bigint |100 |
|3033 |202101065720891000|employee |D |employee |20210670566381 |source1|employee_finance_id |bigint |100 |
|3043 |202101051830043000|department |D |department |20210105185722 |source2|department_id |bigint |20 |
|3043 |202101061830053000|department |D |department |20210106185742 |source2|department_name |varchar |20 |
|3043 |202103012130263000|location |D |location |20210301223658 |source2|location_id |bigint |50 |
|3043 |202101182030083000|location |D |location |20210118213841 |source2|location_name |varchar |50 |
|3050 |202102012030083000|finance |D |finance |20210201212358 |source3|finance_id |bigint |10 |
|3050 |202102061830283000|finance |D |finance |20210206185615 |source3|finance_name |varchar |10 |
+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+
下面是我用来创建JSON的代码-
query = """
SELECT DISTINCT GeneralInfo ,
collect_list(struct(TableName, Columns, UsageInfo)) AS TABLES
FROM
(SELECT (struct(batch_id AS datasetid, '' AS OWNER, batch_frequency AS `RefreshRate`, '' AS ZONE, '' AS Country, max(cast(Job_Run_Upd_Gmt_Ts AS int)) AS LastUpdate, '' AS Description, '' AS BusinessFunction, 'Data Harvesting' AS InfoSource, SOURCE)) AS GeneralInfo ,
target_object AS TableName ,
collect_list(struct(COLUMN_NAME, column_datatype)) AS Columns,
struct(table_row_count AS NumberOfRows) AS UsageInfo
FROM DATA
GROUP BY batch_id,
batch_run_id,
batch_frequency,
SOURCE,
Job_Run_Upd_Gmt_Ts,
target_object,
table_row_count)
GROUP BY GeneralInfo
"""
x = spark.sql(query).toJSON()
x.collect()
但是,由于某些原因,在JSON中缺少了LastUpdate列(假定这是由于我所做的聚合,原因是我希望所有这些target_object都具有最大的Job_Run_Upd_Gmt_Ts)。作为JSON中GeneralInfo列的一部分,LastUpdate也应该被追加。下面是创建的JSON -
{
"GeneralInfo":{
"DataSetID":3033,
"Contacts":"",
"RefreshRate":"D",
"Zone":"",
"Country":"",
"Description":"",
"BusinessFunction":"",
"InfoSource":"Data Harvesting",
"source":"source1"
},
"Tables":[
{
"TableName":"employee",
"Columns":[
{
"column_name":"employee_id",
"column_datatype":"bigint"
},
{
"column_name":"employee_name",
"column_datatype":"varchar"
},
{
"column_name":"employee_department_id",
"column_datatype":"bigint"
},
{
"column_name":"employee_location_id",
"column_datatype":"bigint"
},
{
"column_name":"employee_finance_id",
"column_datatype":"bigint"
}
],
"UsageInfo":{
"NumberOfRows":100
}
}
]
}
请帮我解决这个问题,我到底哪里出了问题(不知道为什么我不能在这个JSON中得到LastUpdate列)。
发布于 2021-08-17 06:09:36
您正在将Job_Run_Upd_Gmt_Ts
转换为int,但数据不在int范围内,您应该考虑将其转换为long。
由于您将列强制转换为int,这超出了int范围,因此您将获得null,而最大值null将为null。to_json
将忽略结构中的空。
spark.sql("select max(cast('20210105213443' as int)) as out ").show(false)
+----+
|out |
+----+
|null|
+----+
spark.sql("select to_json(struct(max(cast('20210105213443' as int)))) as out").show(false)
+---+
|out|
+---+
|{} |
+---+
https://stackoverflow.com/questions/68813747
复制相似问题