文章/答案/技术大牛

发布

数据分析 ——— pandas篇

共 6 篇文章

数据分析 ——— pandas可视化（六）

数据分析 ——— pandas日期处理（五）

数据分析 ——— pandas基础（四）

数据分析 ——— pandas基础（三）

数据分析 ——— pandas基础（二）

数据分析 ——— pandas数据结构（一）

清单首页数据分析 ——— pandas篇文章详情

清单「数据分析 ——— pandas篇」 04/06

数据分析 ——— pandas基础（三）

andrew_a

接着之前的文章，在这里我们来看一些利用pandas处理文本数据，利用索引，loc, iloc,ix，属性选取数据

一、处理文本数据

在这里我们用基本的序列、索引来进行字符串操作

先大致了解一下我们将要用到的函数。

S	功能	描述
1	lower()	将Series / Index中的字符串转换为小写字母。
2	upper()	将Series / Index中的字符串转换为大写。
3	len()	计算字符串的长度。
4	strip()	删除Series / index中每个字符串两侧的空格（包括换行符）。
5	split(' ')	用给定的字符串格式分割每个字符串。
6	cat(sep=' ')	使用给定的分隔符连接序列/索引元素。
7	get_dummies（）	用One-Hot Encoded值返回DataFrame。
8	contains(pattern)	如果子字符串包含在元素中，则返回每个元素的布尔值True，否则返回False。
9	replace(a,b)	将a替换成b
10	repeat(value)	指定每个元素重复的次数。
11	count(pattern)	返回每个元素中的字符出现的次数。
12	startswith(pattern)	Series / Index中的元素是否以某个字符开始，是则返回true
13	endswith(pattern)	Series / Index中的元素是否以某个字符结束，是则返回true。
14	find(pattern)	返回字符串出现的位置
15	findall(pattern)	返回字符所出现的列表。
16	swapcase	将字符串大写的变为小写的，将小写的变为大写的
17	islower()	检查Series / Index中每个字符串中的所有字符是否小写。返回布尔值
18	isupper（）	检查Series / Index中每个字符串中的所有字符是否大写。返回布尔值。
19	isnumeric()	检查Series / Index中每个字符串中的所有字符是否为数字。返回布尔值。

下面我们就来看一下具体的例子：

1）lower()

将字符串中的字符均转换成小写字母

import numpy as np
import pandas as pd
# 处理文本数据
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])
print(s)
print(s.str.lower()) # 将Series / Index中的字符串转换为小写字母。
"""
输出：
s:
0             Tom
1    William Rick
2            John
3         Alber@t
4             NaN
5            1234
6      SteveSmith
dtype: object

lower：
0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      stevesmith
dtype: object
"""

2）upper()

将字符串转换为大写

print(s.str.upper()) # 将Series / Index中的字符串转换为大写。
"""
输出：
0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVESMITH
dtype: object
"""

3） len()

计算字符串的长度

print(s.str.len())
"""
输出：
0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64
"""

4） strip（）

删除字符串前后的空格

s1 = pd.Series(['Tom  ', '  William Rick     ', 'John', 'Alber@t '])
print(s1)
print("after striping")
print(s1.str.strip()) #从两侧删除Series / index中每个字符串的空格（包括换行符）
"""
输出：
0                  Tom  
1      William Rick     
2                   John
3               Alber@t 
dtype: object

after striping
0             Tom
1    William Rick
2            John
3         Alber@t
dtype: object
"""

5) cat（）

使用特点符号将字符串连接

s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
s.str.cat(sep='_') # 使用给定的分隔符连接序列/索引元素。
"""
输出：
'Tom   _William Rick_John_Alber@t'
"""

6）get_dummies()

转换成one_hot编码，也即0,1编码，在之前的文章中有介绍过numpy下的one_hot编码。（数据分析 ——— numpy基础（三））

s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
print(s.str.get_dummies()) # 用One-Hot Encoded值返回DataFrame
"""
输出：
   Alber@t  John  Tom     William Rick
0        0     0       1             0
1        0     0       0             1
2        0     1       0             0
3        1     0       0             0
"""

7） contains（）

查看字符串包含在元素中，则返回每个元素的布尔值True，否则返回False。

# 查看是否含有空格
print(s.str.contains(' ')) # 如果字符串包含在元素中，则返回每个元素的布尔值True，否则返回False。
"""
输出：
0     True
1     True
2    False
3    False
dtype: bool
"""

8） replace()

字符串替换

# 将a替换成b 
s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
print(s,'\n')
print("after replacing @ with %:")
print(s.str.replace('@', '$'),'\n')
print(s.str.replace('m', '$'))
"""
输出：
0          Tom   
1    William Rick
2            John
3         Alber@t
dtype: object

after replacing @ with %:
0          Tom   
1    William Rick
2            John
3         Alber$t
dtype: object

0          To$   
1    Willia$ Rick
2            John
3         Alber@t
dtype: object
"""

9） repeat（）

指定元素的重复次数

#  指定每个元素重复的次数
s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
print(s.str.repeat(2))
"""
输出：
0                Tom   Tom   
1    William RickWilliam Rick
2                    JohnJohn
3              Alber@tAlber@t
dtype: object
"""

10） count()

每个元素中字符出现次数

# 返回每个元素中字符出现次数。
s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
print("the number of 'o's in each string:")
print(s.str.count('o')) # o在字符串中出现 的次数
"""
输出：
the number of 'o's in each string:
0    1
1    0
2    1
3    0
dtype: int64
"""

11） startswith（）

字符串是否由某个字符开始的，是则返回true

# Series / Index中的元素是否以某个字符开始，是则返回true
s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
print("string that start with 'T':")
print(s.str.startswith('T')) # 看结尾是否是以T开始
"""
输出：
string that start with 'T':
0     True
1    False
2    False
3    False
dtype: bool
"""

12） endswith（）

Series / Index中的元素是否以某个字符结束，是则返回true。

# Series / Index中的元素是否以某个字符结束，是则返回true。
s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
print("string that end with 't':")
print(s.str.endswith('t'))
"""
输出：
string that end with 't':
0    False
1    False
2    False
3     True
dtype: bool
"""

13）find()

回字符串出现的位置

# 返回字符串出现的位置
s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
print(s.str.find('o'))
"""
输出：

0    1
1   -1
2    1
3   -1
dtype: int64
"""

14）findall（）

所有字符串出现的列表

# 返回所有出现的列表
s = pd.Series(['Tom   ', 'William Rick', 'John', 'Alber@t'])
print(s.str.findall('o'))
"""
输出：
0    [o]
1     []
2    [o]
3     []
dtype: object
"""

15） swapcase（）

检查Series / Index中每个字符串中的所有字符是否小写，返回布尔值

# 检查Series / Index中每个字符串中的所有字符是否小写，返回布尔值
s = pd.Series(['tom', 'William Rick', 'John', 'Alber@t'])
print(s.str.islower())
"""
输出：
0     True
1    False
2    False
3    False
dtype: bool
"""

16）isupper（）

检查Series / Index中每个字符串中的所有字符是否大写，返回布尔值

# 检查Series / Index中每个字符串中的所有字符是否大写，返回布尔值
s = pd.Series(['Tom', 'William Rick', 'JOHN', 'Alber@t'])
print(s.str.isupper())
"""
输出：
0    False
1    False
2     True
3    False
dtype: bool
"""

17）isnumeric（）

检查Series / Index中每个字符串中的所有字符是否为数字,返回布尔值

# 检查Series / Index中每个字符串中的所有字符是否为数字,返回布尔值
s = pd.Series(['1', 'William Rick', 'John', 'Alber@t'])
print(s.str.isnumeric())
"""
输出：
0     True
1    False
2    False
3    False
dtype: bool
"""

二、pandas索引，选择数据

1) loc[]函数：通过索引''index''中的具体值来去行数据。中括号里面是先行后列，以逗号分割，行和列分别是行标签和列标签。

# loc
import pandas as pd
import numpy as np
# pandas 索引
# loc采用，为分隔符， 分隔两个单列
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
print(df)
print(df.loc[:,['A','C']]) # 行a~h, 列：A,C两列
print(df.loc[['a','b','f','h'],['A','D']]) # 行：a, b,f, h; 列：A, C
print(df.loc['a':'h']) # 行:a~h, 列:所有列
"""
输出：
df
          A         B         C         D
a  0.606264  1.841561 -0.225627  0.985734
b  1.018913  1.848211 -0.061858 -0.865333
c -0.597659  0.136788  1.339162 -1.402188
d  0.060156 -0.739114 -0.922197  1.004415
e -1.254742  0.164954 -0.025894  0.097442
f -0.257760  0.863664  1.237688  1.599834
g -0.843632  0.202047 -0.175664 -0.525140
h -2.419964  0.264638  0.149577 -0.319869

print(df.loc[:,['A','C']]) # 行a~h, 列：A,C两列
          A         C
a  0.606264 -0.225627
b  1.018913 -0.061858
c -0.597659  1.339162
d  0.060156 -0.922197
e -1.254742 -0.025894
f -0.257760  1.237688
g -0.843632 -0.175664
h -2.419964  0.149577

print(df.loc[['a','b','f','h'],['A','D']]) # 行：a, b,f, h; 列：A, C
          A         D
a  0.606264  0.985734
b  1.018913 -0.865333
f -0.257760  1.599834
h -2.419964 -0.319869

print(df.loc['a':'h']) # 行:a~h, 列:所有列
          A         B         C         D
a  0.606264  1.841561 -0.225627  0.985734
b  1.018913  1.848211 -0.061858 -0.865333
c -0.597659  0.136788  1.339162 -1.402188
d  0.060156 -0.739114 -0.922197  1.004415
e -1.254742  0.164954 -0.025894  0.097442
f -0.257760  0.863664  1.237688  1.599834
g -0.843632  0.202047 -0.175664 -0.525140
h -2.419964  0.264638  0.149577 -0.319869
"""

判断：

print(df.loc['a']>0)
"""
输出：
     True
B     True
C    False
D     True
Name: a, dtype: bool
"""

2）iloc[]函数：通过行号来取行数据。中括号里面也是先行后列，行列标签用逗号分割，与loc不同的之处是，.iloc 是根据行数与列数来索引的。

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print(df)
print(df.iloc[:4]) # 取前四列数据
print(df.iloc[2:5, 2:4]) # 行：取第3行到第5行数据，列：取第3列到底4列数据
"""
输出：
          A         B         C         D
0 -2.244208  0.289247 -0.610394  0.122074
1 -0.567245 -0.981979  0.745125  0.307382
2 -0.413786  0.057957  1.369953 -0.591533
3 -0.284472  0.208514 -0.754006 -1.990831
4  0.579098 -0.351095 -1.097065 -2.717286
5  1.391632  1.000434 -0.025586  0.713731
6  0.030316  0.407541  2.015870 -0.550394
7 -0.450774 -0.293389 -0.053082  0.098550

print(df.iloc[:4]) # 取前四列数据
          A         B         C         D
0 -2.244208  0.289247 -0.610394  0.122074
1 -0.567245 -0.981979  0.745125  0.307382
2 -0.413786  0.057957  1.369953 -0.591533
3 -0.284472  0.208514 -0.754006 -1.990831
print(df.iloc[2:5, 2:4]) # 行：取第3行到第5行数据，列：取第3列到底4列数据
          C         D
2  1.369953 -0.591533
3 -0.754006 -1.990831
4 -1.097065 -2.717286

"""

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print(df.iloc[[1, 3, 5], [1, 3]])
print( df.iloc[1:3, :])
print( df.iloc[:,1:3])
"""
输出：
          B         D
1  1.079715  0.471654
3 -0.440755 -0.716878
5  1.161093  0.860139
          A         B         C         D
1 -0.062004  1.079715 -0.769709  0.471654
2 -1.617348  0.942890 -1.416927 -0.494119
          B         C
0 -0.854684 -0.461940
1  1.079715 -0.769709
2  0.942890 -1.416927
3 -0.440755  1.015276
4  0.848106  0.399829
5  1.161093 -0.447737
6  0.740808  0.756544
7  0.201873  0.117193
"""

3） ix[]函数：兼备了lo和iloc两种方法（现在多弃用了）

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print(df.ix[:4],'\n')
print(df.ix[:, 'A'])
"""
输出：
         A         B         C         D
0 -0.823268  0.388715  0.015230 -1.526225
1  0.706624 -0.644479 -0.764379  0.949815
2  1.039163 -0.033233 -0.104077  0.617475
3  0.425551 -1.141799 -0.326049 -0.720935
4  0.872313  1.017448 -0.653088  0.128724 

0   -0.823268
1    0.706624
2    1.039163
3    0.425551
4    0.872313
5   -1.790740
6   -0.270651
7    0.167843
Name: A, dtype: float64
"""

4）直接选择[]

# 直接选择[]
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print(df['A'],'\n')
print(df[['A','B']],'\n')
print(df[2:2])
"""
输出：

0   -1.458988
1   -0.770730
2   -0.263054
3   -0.244680
4    0.566692
5   -1.935684
6    1.971595
7   -1.229495
Name: A, dtype: float64 

          A         B
0 -1.458988 -0.561227
1 -0.770730  0.430620
2 -0.263054 -1.277515
3 -0.244680  0.241342
4  0.566692 -0.469877
5 -1.935684  1.192263
6  1.971595 -0.368445
7 -1.229495  0.871946 

Empty DataFrame
Columns: [A, B, C, D]
Index: []
"""

5）属性访问

# 属性访问
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

print(df.A)
"""
输出：
0    0.645722
1   -0.786466
2   -0.424352
3    1.838143
4    1.163596
5    1.436797
6   -0.756388
7    0.353392
Name: A, dtype: float64
"""

数据分析 ——— pandas篇

数据分析 ——— pandas基础（三）

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐