业界使用最多的Python中Dataframe的重塑变形

Python编程爱好者

发布于 2020-11-30 10:41:27

1.9K0

发布于 2020-11-30 10:41:27

文章被收录于专栏：Python编程爱好者

作者 | 计算广告生态

目录设定

pivot
pivot_table
stack/unstack

pivot

pivot函数用于从给定的表中创建出新的派生表 pivot有三个参数:

索引
列
值


def pivot_simple(index, columns, values):
    """
    index : ndarray
        Labels to use to make new frame's index
    columns : ndarray
        Labels to use to make new frame's columns
    values : ndarray
        Values to use for populating new frame's values

pivot函数将创建一个新表，其行和列索引是相应参数的唯一值读取数据：

from collections import OrderedDict
from pandas import DataFrame
import pandas as pd
import numpy as np
 
data = OrderedDict((
    ("item", ['Item1', 'Item1', 'Item2', 'Item2']),
    ('color', ['red', 'blue', 'red', 'black']),
    ('user',  ['1', '2', '3', '4']),
    ('bm',   ['1', '2', '3', '4'])
))
data = DataFrame(data)

<=============================================
    item  color user bm
0  Item1    red    1  1
1  Item1   blue    2  2
2  Item2    red    3  3
3  Item2  black    4  4

利用pivot进行变形：

df = data.pivot(index='item', columns='color', values='user')
print(df)

<==============================================
color black  blue red
item                 
Item1  None     2   1
Item2     4  None   3

将上述数据中的bm也包含进来

df = data.pivot(index='item', columns='color')
print(df)

<==============================================
color black  blue red black  blue red
item                                 
Item1  None     2   1  None     2   1
Item2     4  None   3     4  None   3

pivot_table

先看如下例子，使用pivot实现

data = OrderedDict((
    ("item", ['Item1', 'Item1', 'Item1', 'Item2']),
    ('color', ['red', 'blue', 'red', 'black']),
    ('user',  ['1', '2', '3', '4']),
    ('bm',   ['1', '2', '3', '4'])
))
data = DataFrame(data)
 
df = data.pivot(index='item', columns='color', values='user')

<==============================================
ValueError: Index contains duplicate entries, cannot reshape

可以看到，现在index和columns对应的位置有不同的值。因此，必须确保我们指定的列和行没有重复的数据，才可以用pivot函数

pivot_table方法实现了类似pivot方法的功能它可以在指定的列和行有重复的情况下使用我们可以使用均值、中值或其他的聚合函数来计算重复条目中的单个值

先看pivot_table这个方法

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',fill_value=None, margins=False, dropna=True,margins_name='All')

parameter	details
data	DataFrame
values	column to aggregate, optional
index	column, Grouper, array, or list of the previous.If an array is passed, it must be the same length as the data. The list.can contain any of the other types (except list)..Keys to group by on the pivot table index. If an array is passed, it.is being used as the same manner as column values.
columns	column, Grouper, array, or list of the previous.If an array is passed, it must be the same length as the data. The list.can contain any of the other types (except list)..Keys to group by on the pivot table column. If an array is passed, it.is being used as the same manner as column values.
aggfunc	function or list of functions, default numpy.mean.If list of functions passed, the resulting pivot table will have.hierarchical columns whose top level are the function names (inferred.from the function objects themselves)
fill_value	scalar, default None.Value to replace missing values with
margins	boolean, default False.Add all row / columns (e.g. for subtotal / grand totals)
dropna	boolean, default True.Do not include columns whose entries are all NaN
margins_name	string, default 'All'.Name of the row / column that will contain the totals.when margins is True.

data = OrderedDict((
    ("item", ['Item1', 'Item1', 'Item1', 'Item2']),
    ('color', ['red', 'blue', 'red', 'black']),
    ('user',  ['1', '2', '3', '4']),
    ('bm',   ['1', '2', '3', '4'])
))
data = DataFrame(data)
 
df = data.pivot_table(index='item', columns='color', values='user', aggfunc=np.min)
print(df)

<==============================================
color black  blue   red
item                   
Item1  None     2     1
Item2     4  None  None

pivot_table()是pivot()的泛化，它允许在数据集中聚合具有相同目标的多个值。

对于不用的列使用通的统计方法

使用字典来实现

df_nodmp5.pivot_table(index="ad_network_name",values=["mt_income","impression"], aggfunc={"mt_income":[np.sum],"impression":[np.sum]})

stack/unstack

事实上，变换一个表只是堆叠DataFrame的一种特殊情况假设我们有一个在行列上有多个索引的DataFrame。堆叠DataFrame意味着移动最里面的列索引成为最里面的行索引，反向操作称之为取消堆叠，意味着将最里面的行索引移动为最里面的列索引。

from pandas import DataFrame
import pandas as pd
import numpy as np
 
# 建立多个行索引
row_idx_arr = list(zip(['r0', 'r0'], ['r-00', 'r-01']))
row_idx = pd.MultiIndex.from_tuples(row_idx_arr)
print "行索引:"
print row_idx

# 建立多个列索引
col_idx_arr = list(zip(['c0', 'c0', 'c1'], ['c-00', 'c-01', 'c-10']))
col_idx = pd.MultiIndex.from_tuples(col_idx_arr)
print "列索引:"
print col_idx

# 创建DataFrame
d = DataFrame(np.arange(6).reshape(2,3), index=row_idx, columns=col_idx)
d = d.applymap(lambda x: (x // 3, x % 3))
print "dataframe:"
print d

# Stack/Unstack
s = d.stack()
u = d.unstack()
print(s)
print(u)

<=====================================================
## 行索引:
MultiIndex(levels=[[u'r0'], [u'r-00', u'r-01']],
           labels=[[0, 0], [0, 1]])

## 列索引:
MultiIndex(levels=[[u'c0', u'c1'], [u'c-00', u'c-01', u'c-10']],
           labels=[[0, 0, 1], [0, 1, 2]])

## dataframe:
             c0              c1
           c-00    c-01    c-10
r0 r-00  (0, 0)  (0, 1)  (0, 2)
   r-01  (1, 0)  (1, 1)  (1, 2)

## stack:
                  c0      c1
r0 r-00 c-00  (0, 0)     NaN
        c-01  (0, 1)     NaN
        c-10     NaN  (0, 2)
   r-01 c-00  (1, 0)     NaN
        c-01  (1, 1)     NaN
        c-10     NaN  (1, 2)

## unstack:
        c0                              c1        
      c-00            c-01            c-10        
      r-00    r-01    r-00    r-01    r-00    r-01
r0  (0, 0)  (1, 0)  (0, 1)  (1, 1)  (0, 2)  (1, 2)

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-11-21，如有侵权请联系 cloudcommunity@tencent.com 删除

linux