问如何在struct中获取聚合列，选择spark？
EN

Stack Overflow用户

提问于 2021-08-17 00:04:31

回答 1查看 119关注 0票数 0

选择时，如何在struct中获得聚合列？

下面是输入数据集-

+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+
|batch_id|batch_run_id      |data_subject_cd|batch_frequency|target_object|Job_Run_Upd_Gmt_Ts|source |column_name           |column_datatype|table_row_count|
+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+
|3033    |202101052030043000|employee       |D              |employee     |20210105213443    |source1|employee_id           |bigint         |100            |
|3033    |202101062030043000|employee       |D              |employee     |20210106211003    |source1|employee_name         |varchar        |100            |
|3033    |202101101830043000|employee       |D              |employee     |20210110185509    |source1|employee_department_id|bigint         |100            |
|3033    |202101034670052000|employee       |D              |employee     |20210430445670    |source1|employee_location_id  |bigint         |100            |
|3033    |202101065720891000|employee       |D              |employee     |20210670566381    |source1|employee_finance_id   |bigint         |100            |
|3043    |202101051830043000|department     |D              |department   |20210105185722    |source2|department_id         |bigint         |20             |
|3043    |202101061830053000|department     |D              |department   |20210106185742    |source2|department_name       |varchar        |20             |
|3043    |202103012130263000|location       |D              |location     |20210301223658    |source2|location_id           |bigint         |50             |
|3043    |202101182030083000|location       |D              |location     |20210118213841    |source2|location_name         |varchar        |50             |
|3050    |202102012030083000|finance        |D              |finance      |20210201212358    |source3|finance_id            |bigint         |10             |
|3050    |202102061830283000|finance        |D              |finance      |20210206185615    |source3|finance_name          |varchar        |10             |
+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+

下面是我用来创建JSON的代码-

query = """
SELECT DISTINCT GeneralInfo ,
                collect_list(struct(TableName, Columns, UsageInfo)) AS TABLES
FROM
  (SELECT (struct(batch_id AS datasetid, '' AS OWNER, batch_frequency AS `RefreshRate`, '' AS ZONE, '' AS Country, max(cast(Job_Run_Upd_Gmt_Ts AS int)) AS LastUpdate, '' AS Description, '' AS BusinessFunction, 'Data Harvesting' AS InfoSource, SOURCE)) AS GeneralInfo ,
          target_object AS TableName ,
          collect_list(struct(COLUMN_NAME, column_datatype)) AS Columns,
          struct(table_row_count AS NumberOfRows) AS UsageInfo
   FROM DATA
   GROUP BY batch_id,
            batch_run_id,
            batch_frequency,
            SOURCE,
            Job_Run_Upd_Gmt_Ts,
            target_object,
            table_row_count)
GROUP BY GeneralInfo
"""

x = spark.sql(query).toJSON()

x.collect()

但是，由于某些原因，在JSON中缺少了LastUpdate列(假定这是由于我所做的聚合，原因是我希望所有这些target_object都具有最大的Job_Run_Upd_Gmt_Ts)。作为JSON中GeneralInfo列的一部分，LastUpdate也应该被追加。下面是创建的JSON -

{
   "GeneralInfo":{
      "DataSetID":3033,
      "Contacts":"",
      "RefreshRate":"D",
      "Zone":"",
      "Country":"",
      "Description":"",
      "BusinessFunction":"",
      "InfoSource":"Data Harvesting",
      "source":"source1"
   },
   "Tables":[
      {
         "TableName":"employee",
         "Columns":[
            {
               "column_name":"employee_id",
               "column_datatype":"bigint"
            },
            {
               "column_name":"employee_name",
               "column_datatype":"varchar"
            },
            {
               "column_name":"employee_department_id",
               "column_datatype":"bigint"
            },
            {
               "column_name":"employee_location_id",
               "column_datatype":"bigint"
            },
            {
               "column_name":"employee_finance_id",
               "column_datatype":"bigint"
            }
         ],
         "UsageInfo":{
            "NumberOfRows":100
         }
      }
      
    ]
 }

请帮我解决这个问题，我到底哪里出了问题(不知道为什么我不能在这个JSON中得到LastUpdate列)。

apache-spark

pyspark

apache-spark-sql

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-08-17 06:09:36

您正在将Job_Run_Upd_Gmt_Ts转换为int，但数据不在int范围内，您应该考虑将其转换为long。

由于您将列强制转换为int，这超出了int范围，因此您将获得null，而最大值null将为null。to_json将忽略结构中的空。

spark.sql("select max(cast('20210105213443' as int)) as out ").show(false)

+----+
|out |
+----+
|null|
+----+

spark.sql("select to_json(struct(max(cast('20210105213443' as int)))) as out").show(false)

+---+
|out|
+---+
|{} |
+---+