首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何在struct中获取聚合列,选择spark?

如何在struct中获取聚合列,选择spark?
EN

Stack Overflow用户
提问于 2021-08-17 08:04:31
回答 1查看 119关注 0票数 0

选择时,如何在struct中获得聚合列?

下面是输入数据集-

代码语言:javascript
运行
复制
+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+
|batch_id|batch_run_id      |data_subject_cd|batch_frequency|target_object|Job_Run_Upd_Gmt_Ts|source |column_name           |column_datatype|table_row_count|
+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+
|3033    |202101052030043000|employee       |D              |employee     |20210105213443    |source1|employee_id           |bigint         |100            |
|3033    |202101062030043000|employee       |D              |employee     |20210106211003    |source1|employee_name         |varchar        |100            |
|3033    |202101101830043000|employee       |D              |employee     |20210110185509    |source1|employee_department_id|bigint         |100            |
|3033    |202101034670052000|employee       |D              |employee     |20210430445670    |source1|employee_location_id  |bigint         |100            |
|3033    |202101065720891000|employee       |D              |employee     |20210670566381    |source1|employee_finance_id   |bigint         |100            |
|3043    |202101051830043000|department     |D              |department   |20210105185722    |source2|department_id         |bigint         |20             |
|3043    |202101061830053000|department     |D              |department   |20210106185742    |source2|department_name       |varchar        |20             |
|3043    |202103012130263000|location       |D              |location     |20210301223658    |source2|location_id           |bigint         |50             |
|3043    |202101182030083000|location       |D              |location     |20210118213841    |source2|location_name         |varchar        |50             |
|3050    |202102012030083000|finance        |D              |finance      |20210201212358    |source3|finance_id            |bigint         |10             |
|3050    |202102061830283000|finance        |D              |finance      |20210206185615    |source3|finance_name          |varchar        |10             |
+--------+------------------+---------------+---------------+-------------+------------------+-------+----------------------+---------------+---------------+

下面是我用来创建JSON的代码-

代码语言:javascript
运行
复制
query = """
SELECT DISTINCT GeneralInfo ,
                collect_list(struct(TableName, Columns, UsageInfo)) AS TABLES
FROM
  (SELECT (struct(batch_id AS datasetid, '' AS OWNER, batch_frequency AS `RefreshRate`, '' AS ZONE, '' AS Country, max(cast(Job_Run_Upd_Gmt_Ts AS int)) AS LastUpdate, '' AS Description, '' AS BusinessFunction, 'Data Harvesting' AS InfoSource, SOURCE)) AS GeneralInfo ,
          target_object AS TableName ,
          collect_list(struct(COLUMN_NAME, column_datatype)) AS Columns,
          struct(table_row_count AS NumberOfRows) AS UsageInfo
   FROM DATA
   GROUP BY batch_id,
            batch_run_id,
            batch_frequency,
            SOURCE,
            Job_Run_Upd_Gmt_Ts,
            target_object,
            table_row_count)
GROUP BY GeneralInfo
"""

x = spark.sql(query).toJSON()

x.collect()

但是,由于某些原因,在JSON中缺少了LastUpdate列(假定这是由于我所做的聚合,原因是我希望所有这些target_object都具有最大的Job_Run_Upd_Gmt_Ts)。作为JSON中GeneralInfo列的一部分,LastUpdate也应该被追加。下面是创建的JSON -

代码语言:javascript
运行
复制
{
   "GeneralInfo":{
      "DataSetID":3033,
      "Contacts":"",
      "RefreshRate":"D",
      "Zone":"",
      "Country":"",
      "Description":"",
      "BusinessFunction":"",
      "InfoSource":"Data Harvesting",
      "source":"source1"
   },
   "Tables":[
      {
         "TableName":"employee",
         "Columns":[
            {
               "column_name":"employee_id",
               "column_datatype":"bigint"
            },
            {
               "column_name":"employee_name",
               "column_datatype":"varchar"
            },
            {
               "column_name":"employee_department_id",
               "column_datatype":"bigint"
            },
            {
               "column_name":"employee_location_id",
               "column_datatype":"bigint"
            },
            {
               "column_name":"employee_finance_id",
               "column_datatype":"bigint"
            }
         ],
         "UsageInfo":{
            "NumberOfRows":100
         }
      }
      
    ]
 }

请帮我解决这个问题,我到底哪里出了问题(不知道为什么我不能在这个JSON中得到LastUpdate列)。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-08-17 14:09:36

您正在将Job_Run_Upd_Gmt_Ts转换为int,但数据不在int范围内,您应该考虑将其转换为long。

由于您将列强制转换为int,这超出了int范围,因此您将获得null,而最大值null将为null。to_json将忽略结构中的空。

代码语言:javascript
运行
复制
spark.sql("select max(cast('20210105213443' as int)) as out ").show(false)

+----+
|out |
+----+
|null|
+----+

spark.sql("select to_json(struct(max(cast('20210105213443' as int)))) as out").show(false)

+---+
|out|
+---+
|{} |
+---+
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68813747

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档