你好,数据科学家和熊猫专家们,
我需要一些帮助来弄清楚如何在应用groupby聚合方法后更好地组织我的数据。我尝试过对新的数据帧进行拆分,但没有产生预期的结果。
这是我的数据框:
df = [{'Store': 's1', 'Date': Timestamp('2020-08-01 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-02 00:00:00'), 'Employee': 'a', 'Department': 'd2', 'ID': 's1ad2', 'Level': 2, 'duties': 'C'},\
{'Store': 's1', 'Date': Timestamp('2020-08-03 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-04 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's1ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-05 00:00:00'), 'Employee': 'a', 'Department': 'd2', 'ID': 's1ad2', 'Level': 2, 'duties': 'C'},\
{'Store': 's2', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'a', 'Department': 'd3', 'ID': 's2ad3', 'Level': 2, 'duties': 'C'},\
{'Store': 's2', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'a', 'Department': 'd1', 'ID': 's2ad1', 'Level': 2, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'a', 'Department': 'd3', 'ID': 's3ad1', 'Level': 2, 'duties': 'C'},\
{'Store': 's1', 'Date': Timestamp('2020-08-01 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-02 00:00:00'), 'Employee': 'b', 'Department': 'd2', 'ID': 's1bd2', 'Level': 1, 'duties': 'C'},\
{'Store': 's1', 'Date': Timestamp('2020-08-03 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-04 00:00:00'), 'Employee': 'b', 'Department': 'd1', 'ID': 's1bd1', 'Level': 1, 'duties': 'O'},\
{'Store': 's1', 'Date': Timestamp('2020-08-05 00:00:00'), 'Employee': 'b', 'Department': 'd2', 'ID': 's1bd2', 'Level': 1, 'duties': 'C'},\
{'Store': 's2', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2ac1', 'Level': 3, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'c', 'Department': 'd3', 'ID': 's2cd3', 'Level': 3, 'duties': 'C'},\
{'Store': 's2', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2cd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'c', 'Department': 'd1', 'ID': 's2cd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's2', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'c', 'Department': 'd3', 'ID': 's3cd1', 'Level': 3, 'duties': 'C'},\
{'Store': 's3', 'Date': Timestamp('2020-08-08 00:00:00'), 'Employee': 'd', 'Department': 'd1', 'ID': 's3cd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's3', 'Date': Timestamp('2020-08-09 00:00:00'), 'Employee': 'd', 'Department': 'd3', 'ID': 's3dd3', 'Level': 3, 'duties': 'C'},\
{'Store': 's3', 'Date': Timestamp('2020-08-10 00:00:00'), 'Employee': 'd', 'Department': 'd1', 'ID': 's3dd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's3', 'Date': Timestamp('2020-08-11 00:00:00'), 'Employee': 'd','Department': 'd1', 'ID': 's3dd1', 'Level': 3, 'duties': 'O'},\
{'Store': 's3', 'Date': Timestamp('2020-08-12 00:00:00'), 'Employee': 'd', 'Department': 'd3', 'ID': 's3dd1', 'Level': 3, 'duties': 'C'}]我想组织我的输出,这样它就可以很好地堆叠在Store --> Department --> Employee中。如下所示(抱歉,输出行不是很好):
Store s1 s2 s3
Department d1 d2 d1 d3 d1 d3
Employee ID Level duties first shift last shift #ofshift first shift last shift #ofshift first shift last shift #ofshift first shift last shift #ofshift first shift last shift #ofshift first shift last shift #ofshift
a s1ad1 2 O 2020-08-01 2020-08-04 3
a s1ad2 2 C 2020-08-02 2020-08-05 2
a s2ad1 2 O 2020-08-08 2020-08-11 3
a s2ad3 2 O 2020-08-09 2020-08-12 2
b s1bd1 1 O 2020-08-01 2020-08-04 3
b s1bd2 1 C 2020-08-02 2020-08-05 2
c s2cd1 3 O 2020-08-08 2020-08-11 3
c s3ad3 3 O 2020-08-09 2020-08-12 2
d s3dd1 3 O 2020-08-08 2020-08-11 3
d s3dd3 3 O 2020-08-09 2020-08-12 2因此,我尝试使用Group by表达式:
df = df.groupby(['Employee', 'Store', 'Department'])\
.agg({'Date':['first', 'last', 'size'],
'ID': 'first',
'Level': 'first',
'duties': 'first'})
# Join the Each Column with its operation.
df.columns = df.columns.map('_'.join)
# Reset the Index
df = df.reset_index().set_index('Employee')
# Renaming Columns of Dataframe.
df.rename(columns={'Date_first':'First Shift',
'Date_last':'Last Shift',
'Date_size':'# of shift',
'ID_first':'ID',
'Level_first':'Level',
'duties_first':'duties'},
inplace=True)这将打印以下结果:
Store Department First Shift Last Shift # of shift ID_first Level duties
Employee
a s1 d1 2020-08-01 2020-08-04 3 s1ad1 2 O
a s1 d2 2020-08-02 2020-08-05 2 s1ad2 2 C
a s2 d1 2020-08-08 2020-08-11 3 s2ad1 2 O
a s2 d3 2020-08-09 2020-08-12 2 s2ad3 2 C
b s1 d1 2020-08-01 2020-08-04 3 s1bd1 1 O
b s1 d2 2020-08-02 2020-08-05 2 s1bd2 1 C
c s2 d1 2020-08-08 2020-08-11 3 s2ac1 3 O
c s2 d3 2020-08-09 2020-08-12 2 s2cd3 3 C
d s3 d1 2020-08-08 2020-08-11 3 s3cd1 3 O
d s3 d3 2020-08-09 2020-08-12 2 s3dd3 3 C然后我应用了unstack表达式,如下所示:
df = df.groupby(['Employee', 'Store', 'Department', 'First Shift', 'Last Shift', '# of shift', 'ID_first',\
'Level', 'duties'])\
.size()\
.unstack(['Store', 'Department']).fillna(0)它按如下方式输出结果:
Store s1 s2 s3
Department d1 d2 d1 d3 d1 d3
Employee First Shift Last Shift # of shift ID_first Level duties
a 2020-08-01 2020-08-04 3 s1ad1 2 O 1.0 0.0 0.0 0.0 0.0 0.0
2020-08-02 2020-08-05 2 s1ad2 2 C 0.0 1.0 0.0 0.0 0.0 0.0
2020-08-08 2020-08-11 3 s2ad1 2 O 0.0 0.0 1.0 0.0 0.0 0.0
2020-08-09 2020-08-12 2 s2ad3 2 C 0.0 0.0 0.0 1.0 0.0 0.0
b 2020-08-01 2020-08-04 3 s1bd1 1 O 1.0 0.0 0.0 0.0 0.0 0.0
2020-08-02 2020-08-05 2 s1bd2 1 C 0.0 1.0 0.0 0.0 0.0 0.0
c 2020-08-08 2020-08-11 3 s2ac1 3 O 0.0 0.0 1.0 0.0 0.0 0.0
2020-08-09 2020-08-12 2 s2cd3 3 C 0.0 0.0 0.0 1.0 0.0 0.0
d 2020-08-08 2020-08-11 3 s3cd1 3 O 0.0 0.0 0.0 0.0 1.0 0.0
2020-08-09 2020-08-12 2 s3dd3 3 C 0.0 0.0 0.0 0.0 0.0 1.0我想我错误地使用了Size和unstack。然而,我似乎想不出如何重新组织数据。
我非常感谢专家对如何正确组织我的数据的意见。
再次感谢您的帮助和体贴。
谢谢。
发布于 2020-10-24 09:01:06
问题的症结在于,您需要在使用.unstack()之前重新构造数据,因为您需要的格式是一个矩阵,其值是三个重复的列。因此,您需要将数据帧从wide更改为long,并在一个列Values中创建一个包含这三个值的新列,另一个列将它们分类为Shift。
# Step 1: Named Groupby Agregation naming columns ins specific format required for `pd.wide_to_long`. Must end with integer.
df = (df.groupby(['Employee', 'Store', 'Department'])
.agg(Shift_1=('Date','first'),
Shift_2=('Date','last'),
Shift_3=('Date','size'),
ID=('ID', 'first'),
Level=('Level', 'first'),
duties=('duties', 'first'))
.reset_index())
# Step 2: In preparation for a matrix The data must be transformed so that the three columns that are values in the matrix must be in long format.
df = pd.wide_to_long(df, stubnames='Shift_', i='ID', j='Shift').reset_index().rename(columns={'Shift_':'Values'})
# Step 3: 1,2,3 were required integer suffixes for wide_to_long but now let's change to what we want the columns to be called.
df['Shift'] = df['Shift'].replace([1,2,3],['First Shift','Last Shift','# of shift'])
# Step 4: Create the matrix be setting index and unstacking to columns
df = (df.sort_values(['Employee', 'Store', 'Department']) #values must be sorted in order for how we want columns to appear in matrix format
.set_index(['Employee', 'ID', 'Level', 'duties', 'Store', 'Department', 'Shift'])
.unstack(['Store', 'Department', 'Shift']).fillna(0))
# Step 5: Cleanup of Multi-index into desred format
df.columns = df.columns.reorder_levels([1,2,3,0]).droplevel(3)
df = df.reset_index()
df
Out[1]:
Store Employee ID Level duties s1 \
Department d1
Shift First Shift
0 a s1ad1 2 O 2020-08-01 00:00:00
1 a s1ad2 2 C 0
2 a s2ad1 2 O 0
3 a s2ad3 2 C 0
4 b s1bd1 1 O 2020-08-01 00:00:00
5 b s1bd2 1 C 0
6 c s2ac1 3 O 0
7 c s2cd3 3 C 0
8 d s3cd1 3 O 0
9 d s3dd3 3 C 0
Store \
Department d2
Shift Last Shift # of shift First Shift
0 2020-08-04 00:00:00 3 0
1 0 0 2020-08-02 00:00:00
2 0 0 0
3 0 0 0
4 2020-08-04 00:00:00 3 0
5 0 0 2020-08-02 00:00:00
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Store ... s2 \
Department ... d1
Shift Last Shift # of shift ... # of shift
0 0 0 ... 0
1 2020-08-05 00:00:00 2 ... 0
2 0 0 ... 3
3 0 0 ... 0
4 0 0 ... 0
5 2020-08-05 00:00:00 2 ... 0
6 0 0 ... 3
7 0 0 ... 0
8 0 0 ... 0
9 0 0 ... 0
Store \
Department d3
Shift First Shift Last Shift # of shift
0 0 0 0
1 0 0 0
2 0 0 0
3 2020-08-09 00:00:00 2020-08-12 00:00:00 2
4 0 0 0
5 0 0 0
6 0 0 0
7 2020-08-09 00:00:00 2020-08-12 00:00:00 2
8 0 0 0
9 0 0 0
Store s3 \
Department d1
Shift First Shift Last Shift # of shift
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 2020-08-08 00:00:00 2020-08-11 00:00:00 3
9 0 0 0
Store
Department d3
Shift First Shift Last Shift # of shift
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 2020-08-09 00:00:00 2020-08-12 00:00:00 2
[10 rows x 22 columns]https://stackoverflow.com/questions/64508095
复制相似问题