真的为这个滚动计算而挣扎..。任何帮助都非常感谢。
关于数据文件:
sale_year seller item_id mean_estimate sale_price deviation status
0 2019 bob 1 20000 11000 -9000 sold
1 2019 alice 2 35000 39000 4000 sold
2 2018 bob 3 15000 17000 2000 not sold
3 2017 alice 4 60000 120000 60000 sold
4 2017 alice 5 50000 80000 30000 sold
5 2017 alice 6 60000 120000 60000 sold
6 2017 alice 7 40000 120000 80000 sold
7 2017 bob 8 20000 27000 7000 sold
8 2017 alice 9 200000 175000 -25000 sold
9 2016 alice 10 100000 150000 50000 sold
10 2015 bob 11 75000 100000 25000 sold
11 2015 alice 12 100000 150000 50000 sold
12 2015 alice 13 10000 15000 5000 sold
13 2015 alice 14 100000 150000 50000 sold
13 2009 alice 15 150000 150000 0 sold 代码:
data = {
"sale_year": [2019, 2019, 2018, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2015, 2015, 2015, 2015, 2009],
"seller": ["bob", "alice", "bob", "alice", "alice", "alice", "alice", "bob", "alice", "alice", "bob", "alice", "alice", "alice", "alice"],
"item_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"mean_estimate": [20000, 35000, 15000, 60000, 50000, 60000, 40000, 20000, 200000, 100000, 75000, 100000, 10000, 100000, 150000],
"sale_price": [11000, 39000, 17000, 120000, 80000, 120000, 120000, 27000, 175000, 150000, 100000, 150000, 15000, 150000, 150000],
"deviation": [-9000, 4000, 2000, 60000, 30000, 60000, 80000, 7000, -25000, 50000, 25000, 50000, 5000, 50000, 0],
"status": ["sold", "sold", "not sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold", "sold"]
}
test = pd.DataFrame(data)我正在评估拍卖销售数据,并试图为每个卖家计算额外的变量,这些变量总结了他们在给定销售前5年的表现(不包括同一年的销售)。
如果某件商品在2019年售出,则返回的变量应该是该卖家2014年至2018年(不包括2019年)期间销售的所有销售额的汇总。如果2018年售出的商品是
要生成的变量:
这段时期内period
现在我知道我可以使用熊猫.groupby()和agg函数来生成个人年份的分组统计数据,例如df.groupby([df.sale_year, df.seller]).mean_estimate.agg([np.mean, np.std])
我利用了smci的有益建议,使用了类似的.rolling()方法:然而,alv = df.groupby(['seller']).rolling(5, min_periods=1).agg({'mean_estimate': ['mean', 'std', 'count'], 'deviation': ['mean', 'std', 'count']})的问题是,它包含了在同一年中发生的销售行(应该将它们排除在外),而我想要使用的时间是销售前的5年,而不是最近的5次交易(在我的数据中有很多5000+销售数据)。
示例输出:
ref_year seller avg_est avg_dev sd_est listings sales prop_sold ln_sales
2019 bob 36666 11333.3 33291.6 3 2 0.66 ln(3)
2019 alice 80000 40000 54543.5 9 9 1 ln(10)
2018 bob 47500 16000 38890.9 2 2 1 ln(3)
2018 alice 80000 40000 54542.5 9 9 1 ln(10)
... ... ... ... ... ... ... ... ...
2010 bob NaN NaN NaN NaN NaN NaN ln(1)
2010 alice 100000 50000 NA 1 1 1 ln(2) 理想情况下,我想坚持熊猫/基本蟒蛇,因为我想把上面所有的计算定义为一个函数,保存在一个库中,以便于将来的ETL。
谢谢您的协助。
我未能通过Pandas和agg方法找到解决方案,但我设法按照@pygirl的建议,在for-循环中正确地聚合和执行了计算。
cols = ['index', 'sale_year', 'seller', 'realized_price','lot_status',
'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log',
'avgestimate', 'avgdeviation', 'sdestimate',
'sddeviation', 'numlistings', 'numlistings_sold', 'propbuyin',
'numlistings_log_prep', 'avgestimate_log_prep', 'cvestimate',
'cvdeviation']
appended_data = []
df_temp = pd.DataFrame(columns=cols)
for i in test.seller.unique():
row_sale_year = np.unique(test.sale_year.values)
row_seller = test.seller.iloc[0]
for year in row_sale_year:
# subset data to listingss relevant to a given seller and sale_year
alv_subset2 = (test.loc[(test.seller==row_seller) & (year-5 <= test.sale_year) & (test.sale_year< year)])
# filter down to only relevant columns
alv_subset2 = alv_subset2[['seller', 'sale_price', 'status', 'mean_estimate', 'deviation']]
alv_subset2['ref_year'] = year
alv_subset2 = alv_subset2.reset_index()
alv_subset2['avg_sale_price'] = np.mean(alv_subset2.sale_price)
alv_subset2['avgestimate'] = np.mean(alv_subset2.mean_estimate)
alv_subset2['numlistings'] = len(alv_subset2.status)
alv_subset2['numlistings_sold'] = len(alv_subset2[alv_subset2['status']=='sold'])
alv_subset2['propbuyin'] = (alv_subset2.numlistings-alv_subset2.numlistings_sold) / alv_subset2.numlistings.apply(lambda x: float(x))
alv_subset2['numlistings_log_prep'] = np.log(1 + alv_subset2.numlistings)
alv_subset2['avgestimate_log_prep'] = np.log(1 + alv_subset2.avgestimate)
for i in to_float:
alv_subset2[i] = alv_subset2[i].astype(float, errors = 'raise')
df_temp = df_temp.append(alv_subset2)
df_temp.drop(['index', 'status','realized_price', 'status', 'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log'], axis=1)
appended_data.append(alv_subset2)
# see pd.concat documentation for more info
df_temp = pd.concat(appended_data)发布于 2021-01-25 14:22:09
我无法通过Pandas和agg方法找到解决方案,但我设法按照@pygirl的建议在for-循环中正确地聚合和执行了计算。
欢迎更多的节奏曲编辑的代码。
谢谢。
cols = ['index', 'sale_year', 'seller', 'realized_price','lot_status',
'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log',
'avgestimate', 'avgdeviation', 'sdestimate',
'sddeviation', 'numlistings', 'numlistings_sold', 'propbuyin',
'numlistings_log_prep', 'avgestimate_log_prep', 'cvestimate',
'cvdeviation']
appended_data = []
df_temp = pd.DataFrame(columns=cols)
for i in test.seller.unique():
row_sale_year = np.unique(test.sale_year.values)
row_seller = test.seller.iloc[0]
for year in row_sale_year:
# subset data to listingss relevant to a given seller and sale_year
alv_subset2 = (test.loc[(test.seller==row_seller) & (year-5 <= test.sale_year) & (test.sale_year< year)])
# filter down to only relevant columns
alv_subset2 = alv_subset2[['seller', 'sale_price', 'status', 'mean_estimate', 'deviation']]
alv_subset2['ref_year'] = year
alv_subset2 = alv_subset2.reset_index()
alv_subset2['avg_sale_price'] = np.mean(alv_subset2.sale_price)
alv_subset2['avgestimate'] = np.mean(alv_subset2.mean_estimate)
alv_subset2['numlistings'] = len(alv_subset2.status)
alv_subset2['numlistings_sold'] = len(alv_subset2[alv_subset2['status']=='sold'])
alv_subset2['propbuyin'] = (alv_subset2.numlistings-alv_subset2.numlistings_sold) / alv_subset2.numlistings.apply(lambda x: float(x))
alv_subset2['numlistings_log_prep'] = np.log(1 + alv_subset2.numlistings)
alv_subset2['avgestimate_log_prep'] = np.log(1 + alv_subset2.avgestimate)
for i in to_float:
alv_subset2[i] = alv_subset2[i].astype(float, errors = 'raise')
df_temp = df_temp.append(alv_subset2)
df_temp.drop(['index', 'status','realized_price', 'status', 'mean_estimated_usd', 'deviation', 'mean_estimated_usd_log', 'deviation_rel_log'], axis=1)
appended_data.append(alv_subset2)
# see pd.concat documentation for more info
df_temp = pd.concat(appended_data)发布于 2021-01-19 20:07:22
您希望通过seller进行聚合,然后使用5年滚动窗口进行聚合,使用 。
( a)首先,sale_year 使成为索引,以便rolling()可以直接处理索引;通过升序(而不是降序)日期对df进行排序也很有帮助:
df = df.set_index('sale_year').sort_index()( b)接下来,您可以使用df.rolling(window=5, min_periods=1). 这个窗口是索引使用的任何单位(这里:年份)和the min_periods=1 argument is to prevent spurious annoying NaNs in the result的'5‘。
然后,使用 , passing it a dict of 'column_name': aggregation_function_or_name计算汇总统计数据。这些函数可以是一个函数,字符串名(例如pd.Series.mean )、“意思”,或者是您自己的自定义函数或lambda函数:
df.groupby('seller').rolling(5, min_periods=1).agg({'mean_estimate': 'mean'})
mean_estimate
seller sale_year
alice 2017 60000.000000
2017 130000.000000
2019 98333.333333
bob 2015 75000.000000
2016 87500.000000
2017 65000.000000
2018 52500.000000然后将所有其他“字段”:“函数”条目添加到聚合dict中。pd.RollingGroupby没有一个与基本GroupBy和实际上 doesn't support most aggregate methods on RollingGroupby不同的size方法,所以您必须编写其中的一些代码。
发布于 2021-01-19 16:28:03
你可以试试这样的东西:
def grpdates(grp):
row_sale_year = grp.sale_year.values
row_seller = grp.seller.iloc[0]
print(row_seller, row_sale_year)
for year in row_sale_year:
df2 = (df1.loc[(df1.seller==row_seller) & (year-5<=df1.sale_year) & (df1.sale_year<=year-1)])
### apply operation for each seller and for their corresponding year
### and concatenate them horizontally with pd.concat(<>)
print(df2)
df1 = df.loc[df.status=='sold']
df1.groupby('seller').apply(grpdates)https://stackoverflow.com/questions/65794989
复制相似问题