我有一个这样的熊猫数据框架:
Account Id Gross Sum Invoice Type Name Net Sum Company Security Supplier Date Completed YearMonth Category
710830 282.81 Invoice 282.81 asd5a Abc 1/1/2018 2018-1 Postal
445800 4868.71 Invoice 3926.4 adc6ac Def 1/1/2018 2018-1 R&D
710350 282.81 Invoice 282.81 fgn6 Ghi 2/9/2018 2018-2 Other
710510 282.81 Invoice 282.81 dg jkl 2/9/2018 2018-2 Electricity
710630 841.59 Invoice 707.07 dfvbfbf mno 3/2/2018 2018-3 Repairs
710610 841.59 Invoice 707.07 rrcv pqr 3/2/2018 2018-3 Leasing
710810 12.14 Invoice 10.12 btbfd stu 1/1/2019 2019-1 Telephone
704300 81517.6 Invoice 65740 dfbtt vwx 1/1/2019 2019-1 Statutory
710510 2105.64 Invoice 1776.53 dfdftb5 dfb 2/9/2019 2019-2 Electricity
710510 2105.64 Invoice 1776.53 ebdfb5b bcd 2/9/2019 2019-2 Electricity
710920 66.96 Invoice 54 dfrrt65 efg 3/2/2019 2019-3 Data
700330 239.47 Invoice 239.47 aae3a11 hij 3/2/2019 2019-3 Coffee我想要的是在数据帧的底部添加行,该数据框架计算了连续3年的同月平均值。
例如:对于year month 2020-1,计算应该是for 2020-1 = sum(Net Sum Company) In 2019-1 + sum(Net Sum Company) in 2018-1 + sum(Net Sum Company) In 2017-1 divided by the number of months considered i.3,所以只需要考虑最后三年。这样,我将得到平均值,并在底部添加与新行相同的内容,其中只有年、月和公司净收入栏的平均值。最终目标是获得这样的数据框架:
Account Id Gross Sum Invoice Type Name Net Sum Company Security Supplier Date Completed YearMonth Category
710830 282.81 Invoice 282.81 asd5a Abc 1/1/2018 2018-1 Postal
445800 4868.71 Invoice 3926.4 adc6ac Def 1/1/2018 2018-1 R&D
710350 282.81 Invoice 282.81 fgn6 Ghi 2/9/2018 2018-2 Other
710510 282.81 Invoice 282.81 dg jkl 2/9/2018 2018-2 Electricity
710630 841.59 Invoice 707.07 dfvbfbf mno 3/2/2018 2018-3 Repairs
710610 841.59 Invoice 707.07 rrcv pqr 3/2/2018 2018-3 Leasing
710810 12.14 Invoice 10.12 btbfd stu 1/1/2019 2019-1 Telephone
704300 81517.6 Invoice 65740 dfbtt vwx 1/1/2019 2019-1 Statutory
710510 2105.64 Invoice 1776.53 dfdftb5 dfb 2/9/2019 2019-2 Electricity
710510 2105.64 Invoice 1776.53 ebdfb5b bcd 2/9/2019 2019-2 Electricity
710920 66.96 Invoice 54 dfrrt65 efg 3/2/2019 2019-3 Data
700330 239.47 Invoice 239.47 aae3a11 hij 3/2/2019 2019-3 Coffee
- - - 34979.66 - - - 2020-1 -
- - - 2059.34 - - - 2020-2 -
- - - 853.805 - - - 2020-3 -我是新来的熊猫,所以任何指导都是非常感谢的。这必须严格使用熊猫。
发布于 2020-03-05 10:57:56
IIUC,你想:
Net Sum Company列YearMonth列H 210f 211代码可以是:
# extract Year and Month Series from the dataframe
year = df['YearMonth'].str.slice(stop=4).astype(int)
month = df['YearMonth'].str.slice(start=5)
# compute the new year per month as max(year) + 1
newyear_month = year.groupby(month).max() + 1
# build a Series aligned with the dataframe from that new year
newyear = pd.DataFrame(month).merge(
pd.DataFrame(newyear_month),
left_on='YearMonth', right_index=True, suffixes=('_x', '')
)['YearMonth'].sort_index()
# compute the sum of relevant years per month
tmp = df.loc[(newyear-3 <= year) & (year <= newyear-1), 'Net Sum Company'
].groupby(month).sum()
# divide by the number of distinct month per sum
tmp /= df.groupby(month)['YearMonth'].nunique()
# compute a YearMonth column for that new dataframe
tmp = pd.concat([newyear_month.astype(str), tmp], axis=1)
tmp['YearMonth'] = tmp['YearMonth'] + '-' + tmp.index # tmp is indexed by month
# force the type of Account Id to object to allow it to contain null values
df['Account Id'] = df['Account Id'].astype(object)
# concat the new rows to the dataframe and reset the index
new_df = df.append(tmp, sort=False).reset_index(drop=True)在您的示例中,new_df给出:
Account Id Gross Sum Invoice Type Name Net Sum Company Security Supplier Date Completed YearMonth Category
0 710830 282.81 Invoice 282.810 asd5a Abc 1/1/2018 2018-1 Postal
1 445800 4868.71 Invoice 3926.400 adc6ac Def 1/1/2018 2018-1 R&D
2 710350 282.81 Invoice 282.810 fgn6 Ghi 2/9/2018 2018-2 Other
3 710510 282.81 Invoice 282.810 dg jkl 2/9/2018 2018-2 Electricity
4 710630 841.59 Invoice 707.070 dfvbfbf mno 3/2/2018 2018-3 Repairs
5 710610 841.59 Invoice 707.070 rrcv pqr 3/2/2018 2018-3 Leasing
6 710810 12.14 Invoice 10.120 btbfd stu 1/1/2019 2019-1 Telephone
7 704300 81517.60 Invoice 65740.000 dfbtt vwx 1/1/2019 2019-1 Statutory
8 710510 2105.64 Invoice 1776.530 dfdftb5 dfb 2/9/2019 2019-2 Electricity
9 710510 2105.64 Invoice 1776.530 ebdfb5b bcd 2/9/2019 2019-2 Electricity
10 710920 66.96 Invoice 54.000 dfrrt65 efg 3/2/2019 2019-3 Data
11 700330 239.47 Invoice 239.470 aae3a11 hij 3/2/2019 2019-3 Coffee
12 NaN NaN NaN 34979.665 NaN NaN NaN 2020-1 NaN
13 NaN NaN NaN 2059.340 NaN NaN NaN 2020-2 NaN
14 NaN NaN NaN 853.805 NaN NaN NaN 2020-3 NaN备注:
每月查找新年允许在滚动年份(例如2017年7月至2019年6月)使用代码,可以用new_df = new_df.fillna('')替换为空字符串(或其他什么)的
。发布于 2020-03-05 09:46:25
对于一个简单的3y滚动平均值,可以这样做:
df1['Date Completed'] = pd.to_datetime(df1['Date Completed'])
df1['roll_3y_avg'] = df1.rolling(window='1096D', on='Date Completed', closed='right')['Net Sum Company'].mean()https://stackoverflow.com/questions/60541755
复制相似问题