数据已按日期排序,
col1 ==1值是唯一的,
而col1==1通过后,它将增加1的增量。1,2,3,4,5,6,7.)只有-1是重复的。
我有一个数据文件看起来像这个叫做df
TEST_schema = StructType([StructField("date", StringType(), True),\
StructField("col1", IntegerType(), True),\
StructField("col2", IntegerType(), True)])
TEST_data = [('2020-08-01',-1,-1),('2020-08-02',-1,-1),('2020-08-03',-1,3),('2020-08-04',-1,2),('2020-08-05',1,4),\
('2020-08-06',2,1),('2020-08-07',3,2),('2020-08-08',4,3),('2020-08-09',5,-1)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df.show()
+--------+----+----+
date |col1|col2|
+--------+----+----+
2020-08-01| -1| -1|
2020-08-02| -1| -1|
2020-08-03| -1| 3|
2020-08-04| -1| 2|
2020-08-05| 1 | 4|
2020-08-06| 2 | 1|
2020-08-07| 3 | 2|
2020-08-08| 4 | 3|
2020-08-09| 5 | -1|
+--------+----+----+
条件是当col1 == 1,然后我们开始从col2 ==4向后添加(例如。4,5,6,7,8,.)后col2 == 4一路返回0(例如。4,0,0,0,0.)
所以,我得到的df看起来会像这样。
+--------+----+----+----+
date |col1|col2|want
+--------+----+----+----+
2020-08-01| -1| -1| 8 |
2020-08-02| -1| -1| 7 |
2020-08-03| -1| 3| 6 |
2020-08-04| -1| 2| 5 |
2020-08-05| 1 | 4| 4 |
2020-08-06| 2 | 1| 0 |
2020-08-07| 3 | 2| 0 |
2020-08-08| 4 | 3| 0 |
2020-08-09| 5 | -1| 0 |
+---------+----+----+----+
增强:我想添加附加条件,其中col2 == -1在col1 == 1(在2020-08-05)和col2 == -1连续。然后,我要计算连续-1,然后添加连续中断col2 ==的位置?价值。这里有一个例子来说明。
+--------+----+----+----+
date |col1|col2|want
+--------+----+----+----+
2020-08-01| -1| -1| 11|
2020-08-02| -1| -1| 10|
2020-08-03| -1| 3| 9 |
2020-08-04| -1| 2| 8 |
2020-08-05| 1 | -1| 7*|
2020-08-06| 2 | -1| 0 |
2020-08-07| 3 | -1| 0 |
2020-08-08| 4 | 4*| 0 |
2020-08-09| 5 | -1| 0 |
+---------+----+----+----+
因此,我们看到连续的31s,(从2020年-08-05开始,我们只关心第一个连续的-1s ),在连续的4个(在2020-08-08表示为*)之后,4+ 3 =7在col1 ==1行。有可能吗?
**我第一次尝试**
TEST_df = TEST_df.withColumn('cumsum', sum(when( col('col1') < 1, col('col1') ) \
.otherwise( when( col('col1') == 1, 1).otherwise(0))).over(Window.partitionBy('col1').orderBy().rowsBetween(-sys.maxsize, 0)))
TEST_df.show()
+----------+----+----+------+
| date|col1|col2|cumsum|
+----------+----+----+------+
|2020-08-01| -1| -1| -1|
|2020-08-02| -1| -1| -2|
|2020-08-03| -1| 3| -3|
|2020-08-04| -1| 2| -4|
|2020-08-05| 1| 4| 1|
|2020-08-07| 3| 2| 0|
|2020-08-09| 5| -1| 0|
|2020-08-08| 4| 3| 0|
|2020-08-06| 2| 1| 0|
+----------+----+----+------+
w1 = Window.orderBy(desc('date'))
w2 =Window.partitionBy('case').orderBy(desc('cumsum'))
TEST_df.withColumn('case', sum(when( (col('cumsum') == 1) & (col('col2') != -1) , col('col2')) \
.otherwise(0)).over(w1)) \
.withColumn('rank', when(col('case') != 0, rank().over(w2)-1).otherwise(0)) \
.withColumn('want', col('case') + col('rank')) \
.orderBy('date') \
+----------+----+----+------+----+----+----+
|date |col1|col2|cumsum|case|rank|want|
+----------+----+----+------+----+----+----+
|2020-08-01|-1 |-1 |-1 |4 |1 |5 |
|2020-08-02|-1 |-1 |-2 |4 |2 |6 |
|2020-08-03|-1 |3 |-3 |4 |3 |7 |
|2020-08-04|-1 |2 |-4 |4 |4 |8 |
|2020-08-05|1 |4 |1 |4 |0 |4 |
|2020-08-06|2 |1 |0 |0 |0 |0 |
|2020-08-07|3 |2 |0 |0 |0 |0 |
|2020-08-08|4 |3 |0 |0 |0 |0 |
|2020-08-09|5 |-1 |0 |0 |0 |0 |
+----------+----+----+------+----+----+----+
你看,这个等级是1,2,3,4,如果我能把它变成4,3,2,1,它看起来就像我得到的数据.如何逆转呢?我试过这两种命令,还有.当然,这是在增强之前
发布于 2020-08-07 02:59:26
IIUC,您可以尝试以下方法:
vals
),按日期按日期对列表进行排序(备注:将groupby(lit(1))
更改为可以用于将数据划分为独立子集的任何列)。idx
,其中包含col1 == 1
col2==-1
at idx
,然后查找从idx到列表开头的偏移量,其中第一行具有col2 != -1
(当前代码中的Note:),如果idx
之前的所有col2都是-1,则偏移量可能为空,您必须决定需要什么。例如,使用coalesce(IF(...),0)
)want
列:
IF(i注意:在生产数据中存在太多列时,可以使用函数应用相同的逻辑。
代码如下:
from pyspark.sql.functions import sort_array, collect_list, struct, expr, lit
TEST_df = spark.createDataFrame([
('2020-08-01', -1, -1), ('2020-08-02', -1, -1), ('2020-08-03', -1, 3),
('2020-08-04', -1, 2), ('2020-08-05', 1, -1), ('2020-08-06', 2, -1),
('2020-08-07', 3, -1), ('2020-08-08', 4, 4), ('2020-08-09', 5, -1)
], ['date', 'col1', 'col2'])
# list of column used in calculation
cols = ["date", "col1", "col2"]
df_new = TEST_df \
.groupby(lit(1)) \
.agg(sort_array(collect_list(struct(*cols)),False).alias('vals')) \
.withColumn('idx', expr("filter(sequence(0,size(vals)-1), i -> vals[i].col1=1)[0]")) \
.withColumn('offset', expr("""
coalesce(IF(vals[idx].col2=-1, filter(sequence(1,idx), i -> vals[idx-i].col2 != -1)[0],0),0)
""")).selectExpr("""
inline(
transform(vals, (x,i) -> named_struct(
'dta', x,
'want', IF(i<idx, 0, vals[idx-offset].col2 + offset + i - idx)
)
)
)""").select('dta.*', 'want')
输出:
df_new.orderBy('date').show()
+----------+----+----+----+
| date|col1|col2|want|
+----------+----+----+----+
|2020-08-01| -1| -1| 11|
|2020-08-02| -1| -1| 10|
|2020-08-03| -1| 3| 9|
|2020-08-04| -1| 2| 8|
|2020-08-05| 1| -1| 7|
|2020-08-06| 2| -1| 0|
|2020-08-07| 3| -1| 0|
|2020-08-08| 4| 4| 0|
|2020-08-09| 5| -1| 0|
+----------+----+----+----+
编辑:每个注释的,添加了一个替代方法来使用窗口聚合函数而不是groupby:
from pyspark.sql import Window
# WindowSpec to cover all related Rows in the same partition
w1 = Window.partitionBy().orderBy('date').rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
cols = ["date", "col1", "col2"]
# below `cur_idx` is the index for the current Row in array `vals`
df_new = TEST_df.withColumn('vals', sort_array(collect_list(struct(*cols)).over(w1),False)) \
.withColumn('idx', expr("filter(sequence(0,size(vals)-1), i -> vals[i].col1=1)[0]")) \
.withColumn('offset', expr("IF(vals[idx].col2=-1, filter(sequence(1,idx), i -> vals[idx-i].col2 != -1)[0],0)")) \
.withColumn("cur_idx", expr("array_position(vals, struct(date,col1,col2))-1")) \
.selectExpr(*TEST_df.columns, "IF(cur_idx<idx, 0, vals[idx-offset].col2 + offset + cur_idx - idx) as want")
https://stackoverflow.com/questions/63290611
复制相似问题