熊猫df代码是
Data = data[data['ObservationDate'] == max(data['ObservationDate'])].reset_index()
Data_world = Data.groupby(["ObservationDate"])["Confirmed","Active_case","Recovered","Deaths"].sum().reset_index()
Data_worldDataframe结构就是这样。
SNo ObservationDate Province/State Country/Region Last Update Confirmed Deaths Recovered Active_case
0 1 01/22/2020 Anhui China 1/22/2020 17:00 1 0 0 1
1 2 01/22/2020 Beijing China 1/22/2020 17:00 14 0 0 14
2 3 01/22/2020 Chongqing China 1/22/2020 17:00 6 0 0 6
3 4 01/22/2020 Fujian China 1/22/2020 17:00 1 0 0 1
4 5 01/22/2020 Gansu China 1/22/2020 17:00 0 0 0 0想要像这样的输出
ObservationDate Confirmed Active_case Recovered Deaths
0 03/22/2020 335957 223441 97882 14634如何过滤最大日期?
max_date = df.select(max("ObservationDate")).first()
group_data = df.groupBy("ObservationDate")
group_data.agg({'Confirmed':'sum', 'Deaths':'sum', 'Recovered':'sum', 'Active_case':'sum'}).show()发布于 2020-03-24 16:12:26
我想这就是你想要的。您可以先collect您的max约会,然后在groupBy和aggregate.之前在filter中使用它。
from pyspark.sql import functions as F
max_date=df.select(F.max("ObservationDate")).collect()[0][0]
df.filter(F.col("ObservationDate")==max_date)\
.groupBy("ObservationDate")\
.agg({'Confirmed':'sum', 'Deaths':'sum', 'Recovered':'sum', 'Active_case':'sum'})\
.show()https://stackoverflow.com/questions/60833722
复制相似问题