如何从给定的开始日期和结束日期时间段针对每个月的值生成新行?比方说,我有一个开始日期列和结束日期列,在两个日期之间有8个月。如何使用具有月份值的新列针对8个月生成8行。假如一月是一月,那么一月是一月,二是二月等到八?
我尝试使用尝试过的rows和array_repeat,这帮助我为每一行生成针对month_between()的行。但这不是我想要的结果。
发布于 2022-10-30 15:04:42
有一个sequence函数,它将使用start、end和step创建数组(如列表理解)。您可以explode该数组来创建行。
如果在某些情况下开始日期和结束日期不匹配,则需要在explode之前调整结果数组。
下面是一个例子
data_sdf. \
withColumn('mth_arr', func.expr('sequence(start_dt, end_dt, interval 1 month)')). \
withColumn('mth_arr_new',
func.when(func.month(func.element_at('mth_arr', -1)) < func.month('end_dt'),
func.flatten(func.array('mth_arr', func.array('end_dt')))
).
when(func.element_at('mth_arr', -1) < func.col('end_dt'),
func.flatten(func.array(func.expr('slice(mth_arr, 1, size(mth_arr)-1)'), func.array('end_dt')))
).
otherwise(func.col('mth_arr'))
). \
selectExpr('start_dt', 'end_dt', 'explode(mth_arr_new) as mths'). \
show(100, truncate=False)
# +----------+----------+----------+
# |start_dt |end_dt |mths |
# +----------+----------+----------+
# |2020-01-01|2020-10-01|2020-01-01|
# |2020-01-01|2020-10-01|2020-02-01|
# |2020-01-01|2020-10-01|2020-03-01|
# |2020-01-01|2020-10-01|2020-04-01|
# |2020-01-01|2020-10-01|2020-05-01|
# |2020-01-01|2020-10-01|2020-06-01|
# |2020-01-01|2020-10-01|2020-07-01|
# |2020-01-01|2020-10-01|2020-08-01|
# |2020-01-01|2020-10-01|2020-09-01|
# |2020-01-01|2020-10-01|2020-10-01|
# |2020-01-10|2020-10-11|2020-01-10|
# |2020-01-10|2020-10-11|2020-02-10|
# |2020-01-10|2020-10-11|2020-03-10|
# |2020-01-10|2020-10-11|2020-04-10|
# |2020-01-10|2020-10-11|2020-05-10|
# |2020-01-10|2020-10-11|2020-06-10|
# |2020-01-10|2020-10-11|2020-07-10|
# |2020-01-10|2020-10-11|2020-08-10|
# |2020-01-10|2020-10-11|2020-09-10|
# |2020-01-10|2020-10-11|2020-10-11|
# |2020-01-10|2020-10-09|2020-01-10|
# |2020-01-10|2020-10-09|2020-02-10|
# |2020-01-10|2020-10-09|2020-03-10|
# |2020-01-10|2020-10-09|2020-04-10|
# |2020-01-10|2020-10-09|2020-05-10|
# |2020-01-10|2020-10-09|2020-06-10|
# |2020-01-10|2020-10-09|2020-07-10|
# |2020-01-10|2020-10-09|2020-08-10|
# |2020-01-10|2020-10-09|2020-09-10|
# |2020-01-10|2020-10-09|2020-10-09|
# |2020-01-01|2020-02-01|2020-01-01|
# |2020-01-01|2020-02-01|2020-02-01|
# +----------+----------+----------+来自sequence的数组如下所示
+----------+----------+------------------------------------------------------------------------------------------------------------------------+
|start_dt |end_dt |mth_arr |
+----------+----------+------------------------------------------------------------------------------------------------------------------------+
|2020-01-01|2020-10-01|[2020-01-01, 2020-02-01, 2020-03-01, 2020-04-01, 2020-05-01, 2020-06-01, 2020-07-01, 2020-08-01, 2020-09-01, 2020-10-01]|
|2020-01-10|2020-10-11|[2020-01-10, 2020-02-10, 2020-03-10, 2020-04-10, 2020-05-10, 2020-06-10, 2020-07-10, 2020-08-10, 2020-09-10, 2020-10-10]|
|2020-01-10|2020-10-09|[2020-01-10, 2020-02-10, 2020-03-10, 2020-04-10, 2020-05-10, 2020-06-10, 2020-07-10, 2020-08-10, 2020-09-10] |
|2020-01-01|2020-02-01|[2020-01-01, 2020-02-01] |
+----------+----------+------------------------------------------------------------------------------------------------------------------------+以及调整后的阵列
+----------+----------+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|start_dt |end_dt |mth_arr |mth_arr_new |
+----------+----------+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|2020-01-01|2020-10-01|[2020-01-01, 2020-02-01, 2020-03-01, 2020-04-01, 2020-05-01, 2020-06-01, 2020-07-01, 2020-08-01, 2020-09-01, 2020-10-01]|[2020-01-01, 2020-02-01, 2020-03-01, 2020-04-01, 2020-05-01, 2020-06-01, 2020-07-01, 2020-08-01, 2020-09-01, 2020-10-01]|
|2020-01-10|2020-10-11|[2020-01-10, 2020-02-10, 2020-03-10, 2020-04-10, 2020-05-10, 2020-06-10, 2020-07-10, 2020-08-10, 2020-09-10, 2020-10-10]|[2020-01-10, 2020-02-10, 2020-03-10, 2020-04-10, 2020-05-10, 2020-06-10, 2020-07-10, 2020-08-10, 2020-09-10, 2020-10-11]|
|2020-01-10|2020-10-09|[2020-01-10, 2020-02-10, 2020-03-10, 2020-04-10, 2020-05-10, 2020-06-10, 2020-07-10, 2020-08-10, 2020-09-10] |[2020-01-10, 2020-02-10, 2020-03-10, 2020-04-10, 2020-05-10, 2020-06-10, 2020-07-10, 2020-08-10, 2020-09-10, 2020-10-09]|
|2020-01-01|2020-02-01|[2020-01-01, 2020-02-01] |[2020-01-01, 2020-02-01] |
+----------+----------+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+发布于 2022-10-29 16:22:47
选项1:基本python和df.unionByName
我的第一个想法是通过基本的Python操作来解决这个问题。从高层次的角度来看,这看起来如下所示:
import pyspark.sql.functions as F
# Create a list of datetime objects
dates = [..]
# Transform to Spark DataFrame
newdf = spark.CreateDataFrame(dates, schema=['date'])
newdf = newdf.withColumn('date', F.to_date('date', format='yyyy-MM-dd'))
# Combine the dataframes
df = df.unionByName(newdf, allowMissingColumns=True)
# Perform operations needed备选方案2:爆炸
如果您想要( i)使这更动态,例如,当有多个日期范围,您需要归罪和( ii)停留在星火,您可以尝试以下。
# Demo data
data = [{'date': '2021-01-01'},
{'date': '2021-02-01'},
{'date': '2022-02-01'}
]
df = spark.createDataFrame(data)
df = df.withColumn('date', F.to_date(F.col('date'), format='yyyy-MM-dd'))
# Determine the next date and the number of months
df = df.withColumn('date_next', F.lead('date').over(Window().orderBy('date')))
df = df.withColumn('numberofmonths', F.months_between('date_next', 'date').cast(T.IntegerType()))
df = df.fillna(0, subset=['numberofmonths'])
@udf(returnType=T.ArrayType(T.IntegerType()))
def makemonths(numberofmonths):
"""UDF to achieve the list comprehension below"""
if numberofmonths > 1:
months = [xx for xx in range(numberofmonths)]
return months
else:
return []
df = df.withColumn('months', makemonths(F.col('numberofmonths')))
# Create a new dataframe with an explode.
# F.explode_outer is chosen over F.explode to retain dates where months = []
newdf = df.select('date', F.explode_outer('months').alias('months'))
newdf = newdf.fillna(0, subset=['months'])
newdf = newdf.withColumn('newdate', F.add_months(F.col('date'), F.col('months')))这将导致以下两个数据格式
>>> df.show()
+----------+----------+--------------+--------------------+
| date| date_next|numberofmonths| months|
+----------+----------+--------------+--------------------+
|2021-01-01|2021-02-01| 1| []|
|2021-02-01|2022-02-01| 12|[0, 1, 2, 3, 4, 5...|
|2022-02-01| null| 0| []|
+----------+----------+--------------+--------------------+
>>> newdf.show()
+----------+------+----------+
| date|months| newdate|
+----------+------+----------+
|2021-01-01| 0|2021-01-01|
|2021-02-01| 0|2021-02-01|
|2021-02-01| 1|2021-03-01|
|2021-02-01| 2|2021-04-01|
|2021-02-01| 3|2021-05-01|
|2021-02-01| 4|2021-06-01|
|2021-02-01| 5|2021-07-01|
|2021-02-01| 6|2021-08-01|
|2021-02-01| 7|2021-09-01|
|2021-02-01| 8|2021-10-01|
|2021-02-01| 9|2021-11-01|
|2021-02-01| 10|2021-12-01|
|2021-02-01| 11|2022-01-01|
|2022-02-01| 0|2022-02-01|
+----------+------+----------+https://stackoverflow.com/questions/74242856
复制相似问题