pandas是基于Numpy构建的,让处理数据、分析数据和可视化数据都会变得更加简单,官网Pandas.正如官网所说:
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
pandas中最主要的数据结构是Series和DataFrame。Series是一种类似numpy的一维数组对象,它由一组数据和数据标签(索引)组成.
在使用Series之前需要安装pandas的类库,通过pip即可以安装pandas。
pip install pandas
然后通过python导入即可。
In [3]: from pandas import (Series,DataFrame)
In [4]: import pandas as pd
In [5]: test = Series([1,2,3,-6])
In [6]: test
Out[6]:
0 1
1 2
2 3
3 -6
dtype: int64
最左边的是数据的索引,右边是数据的值,dtype代表数据的存储的格式。正如上面看到的,如果我们没有初始化一个index数组时,Series会自动创建一个从整数索引。如果获取Series的值和索引呢?
In [7]: test.values
Out[7]: array([ 1, 2, 3, -6])
In [8]: test.index
Out[8]: RangeIndex(start=0, stop=4, step=1)
我们看一下Series的参数和说明:
Init signature: Series(self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Parameters data : array-like, dict, or scalar value Contains data stored in Series index : array-like or Index (1d) Values must be unique and hashable, same length as data. Index object (or other iterable of same length as data) Will default to RangeIndex(len(data)) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict. dtype : numpy.dtype or None If None, dtype will be inferred copy : boolean, default False Copy input data File: /usr/local/lib/python2.7/site-packages/pandas/core/series.py Type: type
In [19]: test = Series([-2,3,4,-10],index=['a','b','c','d'])
In [20]: test[['a','c']]
Out[20]:
a -2
c 4
dtype: int64
In [21]: test[test>0]
Out[21]:
b 3
c 4
dtype: int64
In [22]: test*2
Out[22]:
a -4
b 6
c 8
d -20
dtype: int64
In [23]: 'a' in test
Out[23]: True
还可以把Series看做成一个字典,字典的key映射到index,字典的value映射到data。Series还会对不同字典的具有相同的key进行计算时,Serise会自动对齐索引。
In [29]: d = {'name':'brian','age':12}
In [30]: test = Series(d)
In [31]: test
Out[31]:
age 12
name brian
dtype: object
In [32]: c = {'a':12,'b':23,'c':24}
In [33]: test1 = Series(c)
In [34]: test1
Out[34]:
a 12
b 23
c 24
dtype: int64
In [35]: d=['a','c','b','d']
In [36]: test2 = Series(c,index=d)
In [37]: test2
Out[37]:
a 12.0
c 24.0
b 23.0
d NaN
dtype: float64
In [51]: test1+test2
Out[51]:
a 24.0
b 46.0
c 48.0
d NaN
dtype: float64
In [56]: test1.index=['k','e','g']
In [57]: test1
Out[57]:
k 12
e 23
g 24
dtype: int64
在数据处理中经常对缺失或者NA值进行处理,pandas提供了isnull和notnull来检测null值,并通过fillna来填充为NaN的值。fillna中有一个method参数,里面可以按照很多方式去处理你的业务。在后面详细介绍。
In [40]: pd.isnull(test2)
Out[40]:
a False
c False
b False
d True
dtype: bool
In [41]: pd.notnull(test2)
Out[41]:
a True
c True
b True
d False
dtype: bool
In [43]: test2.fillna(22)
Out[43]:
a 12.0
c 24.0
b 23.0
d 22.0
dtype: float64
DataFrame是一个表格型的数据结构,你可以理解为一个多维的行列数据结构。可以进行面向行和面向列处理。DataFrame不仅包含了index还增加了Columns,因为这是一个多维结构。
In [60]: data = {'state':['oh','oh','oh','ne','ne'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]}
In [61]: frame = DataFrame(data)
In [62]: frame
Out[62]:
pop state year
0 1.5 oh 2000
1 1.7 oh 2001
2 3.6 oh 2002
3 2.4 ne 2001
4 2.9 ne 2002
你也可以指定columns的顺序,如果你指定的Column不存在时默认用NaN填充。
In [64]: frame = DataFrame(data,columns=['year','state','pop','exit'],index=['a','b','c','d','k'])
In [65]: frame
Out[65]:
year state pop exit
a 2000 oh 1.5 NaN
b 2001 oh 1.7 NaN
c 2002 oh 3.6 NaN
d 2001 ne 2.4 NaN
k 2002 ne 2.9 NaN
DataFrame操作默认是面向列操作的,单列的数据操作对象你可以把它当成Series对象处理。
In [71]: frame['year']
Out[71]:
a 2000
b 2001
c 2002
d 2001
k 2002
Name: year, dtype: int64
In [72]: frame.ix['a']
Out[72]:
year 2000
state oh
pop 1.5
exit NaN
Name: a, dtype: object
In [75]: frame.ix[['a','b'],['year','pop']]
Out[75]:
year pop
a 2000 1.5
b 2001 1.7
In [77]: frame['exit']=16.5
In [78]: frame
Out[78]:
year state pop exit
a 2000 oh 1.5 16.5
b 2001 oh 1.7 16.5
c 2002 oh 3.6 16.5
d 2001 ne 2.4 16.5
k 2002 ne 2.9 16.5
In [79]: ser = Series([12,23,4,4,5],index=['a','b','c','d','k'])
In [80]: frame['exit']=ser
In [81]: frame
Out[81]:
year state pop exit
a 2000 oh 1.5 12
b 2001 oh 1.7 23
c 2002 oh 3.6 4
d 2001 ne 2.4 4
k 2002 ne 2.9 5
1.reindex,创建一个适应新索引的新对象。再重新生成reindex时如果不存在index可以采用fill_value来生成填充值。如果index是一个有序数据时可以通过method=”ffill(向前填充)/pad(向后搬运) bfill(向后填充)或backfill(向后搬运)”
In [11]: test = Series([1,2,3,4,5],index=['e','b','c','d','a'])
In [12]: test
Out[12]:
e 1
b 2
c 3
d 4
a 5
dtype: int64
In [13]: test1 = test.reindex(['a','b','k','g','h'],fill_value=0)
In [14]: test1
Out[14]:
a 5
b 2
k 0
g 0
h 0
dtype: int64
In [21]: test3 = Series([1,2,3,4,5],index=[2,4,7,8,9])
In [23]: test3.reindex(range(10),method='bfill')
Out[23]:
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 3
8 4
9 5
dtype: int64
In [36]: test = DataFrame(np.arange(16).reshape((4,4)),index=['a','b','c','d'],columns=['oh','te','ca','ka'])
In [37]: test
Out[37]:
oh te ca ka
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
In [38]: test1 = test.reindex(columns=['oh','ca','aa','ss'],fill_value=12)
In [39]: test1
Out[39]:
oh ca aa ss
a 0 2 12 12
b 4 6 12 12
c 8 10 12 12
d 12 14 12 12
In [40]: test1 = test.reindex(['d','a','b','c'])
In [41]: test1
Out[41]:
oh te ca ka
d 12 13 14 15
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
2.drop系列方法和常用操作
In [42]: test = Series([1,1,2,np.nan],index=list("abcd"))
In [43]: test
Out[43]:
a 1.0
b 1.0
c 2.0
d NaN
dtype: float64
In [44]: test.drop('a')
Out[44]:
b 1.0
c 2.0
d NaN
dtype: float64
In [36]: test.dropna()
Out[36]:
a 1.0
b 1.0
c 2.0
dtype: float64
In [37]: test>5
Out[37]:
a False
b False
c False
d False
dtype: bool
In [38]: test>1
Out[38]:
a False
b False
c True
d False
dtype: bool
In [39]: test[test>1]
Out[39]:
c 2.0
dtype: float64
In [40]: test+2
Out[40]:
a 3.0
b 3.0
c 4.0
d NaN
dtype: float64
In [55]: test = DataFrame({"key":[1,1,2,np.nan],"value":[2,1,2,4]})
In [56]: test
Out[56]:
key value
0 1.0 2
1 1.0 1
2 2.0 2
3 NaN 4
In [61]: test.drop(0)
Out[61]:
key value
1 1.0 1
2 2.0 2
3 NaN 4
In [58]: test.drop('key',axis=1)
Out[58]:
value
0 2
1 1
2 2
3 4
pandas的函数处理一部分是顶级的函数pandas函数提供的。
pandas顶级函数,所对应的操作是+、-、/ * 。比较简单自己实操。
apply作用在行或者列,applymap作用在元素级别。
In [69]: test = DataFrame(np.arange(16).reshape((4,4)),columns=list("abcd"),index=list("kbgh"))
In [70]: test
Out[70]:
a b c d
k 0 1 2 3
b 4 5 6 7
g 8 9 10 11
h 12 13 14 15
In [71]: f = lambda x: x.max()-x.min()
In [72]: test.apply(f,axis=1)
Out[72]:
k 3
b 3
g 3
h 3
dtype: int64
In [73]: test.apply(f)
Out[73]:
a 12
b 12
c 12
d 12
dtype: int64
In [74]: def f(x):
...: return Series([x.min(),x.max()],index=["min","max"])
...:
In [75]: test.apply(f)
Out[75]:
a b c d
min 0 1 2 3
max 12 13 14 15
In [76]: test.apply(f,axis=1)
Out[76]:
min max
k 0 3
b 4 7
g 8 11
h 12 15
In [79]: def f(x):
...: return '%2d.0' % x
In [80]: test.applymap(f)
Out[80]:
a b c d
k 0.0 1.0 2.0 3.0
b 4.0 5.0 6.0 7.0
g 8.0 9.0 10.0 11.0
h 12.0 13.0 14.0 15.0
# series与之相对应的是map. DataFrame['a'].map(f)
sort_values主要针对行或者列数据操作排序,或者说是直接对series对象操作,如果针对DataFrame操作必须通过by指定行或列的名字。sort_index返回的是经过排序一个新对象。
In [99]: test.sort_values(by=['a','d'])
Out[99]:
a b c d
k 0 1 2 3
b 4 5 6 7
g 8 9 10 11
h 12 13 14 15
In [100]: test['a'].sort_values()
Out[100]:
k 0
b 4
g 8
h 12
Name: a, dtype: int64
In [101]: test.sort_index()
Out[101]:
a b c d
b 4 5 6 7
g 8 9 10 11
h 12 13 14 15
k 0 1 2 3
numpy.argsort返回numpy array数组从小到大排序的索引array。
test = np.array([1,4,2,-2])
In [107]: test
Out[107]: array([ 1, 4, 2, -2])
#test 从小到大对应的数组是【-2,1,2,4】,-2在test中的index是3,1在test中的index是1,4在test中的index是1.
In [108]: test.argsort()
Out[108]: array([3, 0, 2, 1])
sum、mean、idxmax,idxmin,max、min、var、std和count等。这些函数自己操作去看看,如果需要特殊的操作可以google一下。
In [11]: test = DataFrame([[10.2,np.nan],[8.2,-10.1],[np.nan,np.nan],[10,2]],index=list('abcd'),columns=['br','kw'])
In [12]: test
Out[12]:
br kw
a 10.2 NaN
b 8.2 -10.1
c NaN NaN
d 10.0 2.0
In [13]: test.sum()
Out[13]:
br 28.4
kw -8.1
dtype: float64
In [14]: test.sum(axis=1)
Out[14]:
a 10.2
b -1.9
c NaN
d 12.0
dtype: float64
pandas具备很多操作和函数,有需要的可以到它官方文档自己去实现。