作者 | 计算广告生态
pivot函数用于从给定的表中创建出新的派生表 pivot有三个参数:
def pivot_simple(index, columns, values):
"""
index : ndarray
Labels to use to make new frame's index
columns : ndarray
Labels to use to make new frame's columns
values : ndarray
Values to use for populating new frame's values
pivot函数将创建一个新表,其行和列索引是相应参数的唯一值 读取数据:
from collections import OrderedDict
from pandas import DataFrame
import pandas as pd
import numpy as np
data = OrderedDict((
("item", ['Item1', 'Item1', 'Item2', 'Item2']),
('color', ['red', 'blue', 'red', 'black']),
('user', ['1', '2', '3', '4']),
('bm', ['1', '2', '3', '4'])
))
data = DataFrame(data)
<=============================================
item color user bm
0 Item1 red 1 1
1 Item1 blue 2 2
2 Item2 red 3 3
3 Item2 black 4 4
利用pivot进行变形:
df = data.pivot(index='item', columns='color', values='user')
print(df)
<==============================================
color black blue red
item
Item1 None 2 1
Item2 4 None 3
将上述数据中的bm也包含进来
df = data.pivot(index='item', columns='color')
print(df)
<==============================================
color black blue red black blue red
item
Item1 None 2 1 None 2 1
Item2 4 None 3 4 None 3
先看如下例子,使用pivot实现
data = OrderedDict((
("item", ['Item1', 'Item1', 'Item1', 'Item2']),
('color', ['red', 'blue', 'red', 'black']),
('user', ['1', '2', '3', '4']),
('bm', ['1', '2', '3', '4'])
))
data = DataFrame(data)
df = data.pivot(index='item', columns='color', values='user')
<==============================================
ValueError: Index contains duplicate entries, cannot reshape
可以看到,现在index和columns对应的位置有不同的值。因此,必须确保我们指定的列和行没有重复的数据,才可以用pivot函数
pivot_table方法实现了类似pivot方法的功能 它可以在指定的列和行有重复的情况下使用 我们可以使用均值、中值或其他的聚合函数来计算重复条目中的单个值
先看pivot_table这个方法
pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',fill_value=None, margins=False, dropna=True,margins_name='All')
parameter | details |
---|---|
data | DataFrame |
values | column to aggregate, optional |
index | column, Grouper, array, or list of the previous.If an array is passed, it must be the same length as the data. The list.can contain any of the other types (except list)..Keys to group by on the pivot table index. If an array is passed, it.is being used as the same manner as column values. |
columns | column, Grouper, array, or list of the previous.If an array is passed, it must be the same length as the data. The list.can contain any of the other types (except list)..Keys to group by on the pivot table column. If an array is passed, it.is being used as the same manner as column values. |
aggfunc | function or list of functions, default numpy.mean.If list of functions passed, the resulting pivot table will have.hierarchical columns whose top level are the function names (inferred.from the function objects themselves) |
fill_value | scalar, default None.Value to replace missing values with |
margins | boolean, default False.Add all row / columns (e.g. for subtotal / grand totals) |
dropna | boolean, default True.Do not include columns whose entries are all NaN |
margins_name | string, default 'All'.Name of the row / column that will contain the totals.when margins is True. |
data = OrderedDict((
("item", ['Item1', 'Item1', 'Item1', 'Item2']),
('color', ['red', 'blue', 'red', 'black']),
('user', ['1', '2', '3', '4']),
('bm', ['1', '2', '3', '4'])
))
data = DataFrame(data)
df = data.pivot_table(index='item', columns='color', values='user', aggfunc=np.min)
print(df)
<==============================================
color black blue red
item
Item1 None 2 1
Item2 4 None None
pivot_table()是pivot()的泛化,它允许在数据集中聚合具有相同目标的多个值。
使用字典来实现
df_nodmp5.pivot_table(index="ad_network_name",values=["mt_income","impression"], aggfunc={"mt_income":[np.sum],"impression":[np.sum]})
事实上,变换一个表只是堆叠DataFrame的一种特殊情况 假设我们有一个在行列上有多个索引的DataFrame。堆叠DataFrame意味着移动最里面的列索引成为最里面的行索引,反向操作称之为取消堆叠,意味着将最里面的行索引移动为最里面的列索引。
from pandas import DataFrame
import pandas as pd
import numpy as np
# 建立多个行索引
row_idx_arr = list(zip(['r0', 'r0'], ['r-00', 'r-01']))
row_idx = pd.MultiIndex.from_tuples(row_idx_arr)
print "行索引:"
print row_idx
# 建立多个列索引
col_idx_arr = list(zip(['c0', 'c0', 'c1'], ['c-00', 'c-01', 'c-10']))
col_idx = pd.MultiIndex.from_tuples(col_idx_arr)
print "列索引:"
print col_idx
# 创建DataFrame
d = DataFrame(np.arange(6).reshape(2,3), index=row_idx, columns=col_idx)
d = d.applymap(lambda x: (x // 3, x % 3))
print "dataframe:"
print d
# Stack/Unstack
s = d.stack()
u = d.unstack()
print(s)
print(u)
<=====================================================
## 行索引:
MultiIndex(levels=[[u'r0'], [u'r-00', u'r-01']],
labels=[[0, 0], [0, 1]])
## 列索引:
MultiIndex(levels=[[u'c0', u'c1'], [u'c-00', u'c-01', u'c-10']],
labels=[[0, 0, 1], [0, 1, 2]])
## dataframe:
c0 c1
c-00 c-01 c-10
r0 r-00 (0, 0) (0, 1) (0, 2)
r-01 (1, 0) (1, 1) (1, 2)
## stack:
c0 c1
r0 r-00 c-00 (0, 0) NaN
c-01 (0, 1) NaN
c-10 NaN (0, 2)
r-01 c-00 (1, 0) NaN
c-01 (1, 1) NaN
c-10 NaN (1, 2)
## unstack:
c0 c1
c-00 c-01 c-10
r-00 r-01 r-00 r-01 r-00 r-01
r0 (0, 0) (1, 0) (0, 1) (1, 1) (0, 2) (1, 2)