我正在合并熊猫数据帧中的重叠间隔,并寻找在熊猫中这样做的有效方法,除了在第1行×1行上运行的常规算法之外,我如何在熊猫上这样做?
我已经尝试过在每一行上运行定期重叠的方法,并询问当前的row.start < last_end。这对我来说很管用。
假设我输入了以下数据
df:
START FINISH
0 0.000000 10.000000
1 10.000000 8700.182997
2 0.000000 10.000000
3 10.000000 9720.687227
4 9850.123 9990.000000
我预计产出如下:
df:
START FINISH
0 0.000000 9720.687227
2 9850.123 9990.000000
提前感谢!
发布于 2019-09-11 07:01:06
你只需要熊猫就能做到
import pandas as pd
import io
## load data
raw ="""START,FINISH
0.000000 ,10.000000
10.000000 ,4500.182997
5000.00 ,7000.000000
6000 ,8500.687227
9850.123,9990.000000
"""
buf_bytes = io.StringIO(raw)
df=pd.read_csv(buf_bytes)
## solution
df.sort_values("START", inplace=True)
## This line compares if START of next row is greater than FINISH of current
## row ("shift" shifts down FINISH by one row). The value of expression before
## cumsum will be True if interval breaks (i.e. cannot be merged), so
## cumsum will increment group value when interval breaks (cum sum treats True=1, False=0)
df["group"]=(df["START"]>df["FINISH"].shift()).cumsum()
## this returns min value of "START" column from a group and max value fro m "FINISH"
result=df.groupby("group").agg({"START":"min", "FINISH": "max"})
display(result)
输出
START FINISH
group
0 0.000 4500.182997
1 5000.000 8500.687227
2 9850.123 9990.000000
发布于 2020-12-14 02:37:44
上面的答案是鼓舞人心的,但仍有一些需要改进的地方。
(1)应该记录shift()
将上升一个记录,而不是下降。(2)不考虑某一行何时在上一纪录的界线内。只需添加cummax()
就可以解决问题。
以下是修改后的代码:
import pandas as pd
import io
## load data
raw ="""START,FINISH
0.000000 ,10.000000
2.000000 ,3.000000
10.000000 ,4500.182997
5000.00 ,7000.000000
6000 ,8500.687227
9850.123,9990.000000
"""
buf_bytes = io.StringIO(raw)
df=pd.read_csv(buf_bytes)
## solution
df.sort_values("START", inplace=True)
## This line compares if START of present row is greater than largest FINISH in previous
## rows ("shift" shifts up FINISH by one row). The value of expression before
## cumsum will be True if interval breaks (i.e. cannot be merged), so
## cumsum will increment group value when interval breaks (cum sum treats True=1, False=0)
df["group"]=(df["START"]>df["FINISH"].shift().cummax()).cumsum()
print(df)
## this returns min value of "START" column from a group and max value fro m "FINISH"
result=df.groupby("group").agg({"START":"min", "FINISH": "max"})
print(result)
产出:
START FINISH
group
0 0.000 4500.182997
1 5000.000 8500.687227
2 9850.123 9990.000000
未经修改的解决方案的结果:
START FINISH
group
0 0.000 10.000000
1 10.000 4500.182997
2 5000.000 8500.687227
3 9850.123 9990.000000
https://stackoverflow.com/questions/57882621
复制相似问题