问Pandas Dataframe的简要计算
EN

Stack Overflow用户

提问于 2013-11-01 14:03:50

回答 3查看 891关注 0票数 1

我有一个看起来像底部的DF (节选，有4个区域，每个季度的日期都在扩展)。

我想要创建一个df (按区域划分)，其中仅包含最新日期与前一个季度和前一个季度(同一季度)之间的差异。

此时，region和Quradate都是索引。

所以我想要这样的东西(不是很接近)：

(['region'] ['Quradate'][-1:-1])-(['region'] ['Quradate'][-2:-2]) 
& (['region']  ['Quradate'][-1:-1])-(['region'] ['Quradate'][-5:-5])

因此，我将以每个地区两排结束，第一排与分数的差异(实际上有5分)从上个季度开始，第二行与前一年的差异。

卡住了。

                                                                  Score1      Score2  
region                                           Quradate           
North_Central-Birmingham-Tuscaloosa-Anniston 2010-01-15             47           50
                                             2010-04-15             45           60
                                             2010-07-15             45           40
                                             2010-10-15             42           43
                                             2011-01-15             46           44
                                             2011-04-15             45           45
                                             2011-07-15             45           45
                                             2011-10-15             43           46
                                             2012-01-15             51           55
                                             2012-04-15             53           56
                                             2012-07-15             51           57
                                             2012-10-15             52           58
                                             2013-01-15             50           50
                                             2013-04-15             55           55
                                             2013-07-15             55           56
                                             2013-10-15             51           66   
North_Huntsville-Decatur-Florence            2010-01-15             55           55

python

pandas

回答 3

Stack Overflow用户

回答已采纳

发布于 2013-11-03 14:29:44

有关解决方案和讨论，请参见此处：Selecting a new dataframe via a multi-indexed frame in Pandas using index names

基本上，你所需要的是与前一时期的区别。

df.groupby(level='region').apply(lambda x: x.diff().iloc[-1])

和一年前的差别(4个季度)

df.groupby(level='region').apply(lambda x: x.diff(4).iloc[-1])

票数 1

Stack Overflow用户

发布于 2013-11-01 15:48:02

我认为你有点走在正确的轨道上。在我看来，我会创建一个函数来计算您正在寻找的两个值，并返回一个数据框架。如下所示：

def find_diffs(region):
    score_cols = ['Score1', 'Score2']

    most_recent_date = region.Quradate.max()
    last_quarter = most_recent_date - datetime.timedelta(365/4) # shift by 4 months
    last_year = most_recent_date - datetime.timedelta(365) # shift by a year

    quarter_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_quarter)].diff()
    quarter_score_diff['id'] = 'quarter_diff'

    year_score_diff = region[region.Quradate == most_recent_date OR region.Quradate == last_year)].diff()
    year_score_diff['id'] = 'year_diff'

    df_temp = quarter_score_diff.append(year_score_diff)
    return df_temp

然后你就可以：

DF.groupby(['region']).apply(find_diffs)

结果将是一个按区域索引的DF，其中包含针对每个分数差的列，以及一个额外的列，该列将每一行标识为四分之一或年度差异。

票数 0

Stack Overflow用户

发布于 2013-11-03 09:03:31

编写一个与groupby一起使用的函数肯定是一种选择，另一件容易做的事情是列出组中的数据，并使用indeces进行计算，这是可能的，因为数据的间隔是固定的(并且要记住，只有在数据有规律的间隔时才能这样做)。这个方法可以避免真正处理日期的问题。首先，我将重新编制索引，以便该区域以列的形式出现在dataframe中，然后执行以下操作：

#First I create some data
Dates = pd.date_range('2010-1-1', periods = 14, freq = 'Q')
Regions = ['Western', 'Eastern', 'Southern', 'Norhtern']
df = DataFrame({'Regions': [elem for elem in Regions for x in range(14)], \
            'Score1' : np.random.rand(56), 'Score2' : np.random.rand(56), 'Score3' : np.random.rand(56), \
            'Score4' : np.random.rand(56), 'Score5' : np.random.rand(56)}, index = list(Dates)*4)

# Create a dictionary to hold your data
SCORES = ['Score1', 'Score2', 'Score3', 'Score4', 'Score5']
ValuesDict = {region : {score : [int(), int()] for score in SCORES} for region in df.Regions.unique()}

#This dictionary will contain keys that are your regions, and these will correspond to a dictionary that has keys that are your scores and those correspond to a list of which the fisrt element is the most recent - last quarter calculation, and the second is the most recent - last year calcuation. 

#Now group the data
dfGrouped = df.groupby('Regions')

#Now iterate through the groups creating lists of the underlying data. The data that is at the last index point of the list is by definition the newest (due to the sorting when grouping) and the obervation one year previous to that is - 4 index points away.

for group in dfGrouped:
    Score1List = list(group[1].Score1)
    Score2List = list(group[1].Score2)
    Score3List = list(group[1].Score3)
    Score4List = list(group[1].Score4)
    Score5List = list(group[1].Score5)
    MasterList = [Score1List, Score2List, Score3List, Score4List, Score5List]
    for x in xrange(1, 6):
        ValuesDict[group[0]]['Score' + str(x)][0] = MasterList[x-1][-1] - MasterList[x-1][-2]
        ValuesDict[group[0]]['Score' + str(x)][1] = MasterList[x-1][-1] - MasterList[x-1][-5]

ValuesDict