正则表达式

小闫同学啊

发布于 2019-07-18 11:33:33

2.5K0

发布于 2019-07-18 11:33:33

文章被收录于专栏：小闫笔记

1.正则表达式

1.1正则表达式的介绍

是什么：是表达式，描述一种对数据过滤的一种逻辑。“规则字符串”

干什么的：爬虫、web开发等等

作用：表达对字符串数据的匹配、过滤、提取逻辑的表达式字符串

特点：功能强大，通用性强、适合很多编程语言

1.2简单使用

Regular Expression ——→ re

匹配结果对象 = re.match(正则,数据)

从头开始匹配，如果某一个匹配失败，那么整体失败，
如果匹配成功返回匹配结果对象
如果匹配失败，返回None

获取匹配结果

匹配结果对象.group()

强烈建议：正则一律加上r字符（不加可能有问题，加上r肯定没有问题（分组里面不加r会出现问题））

正则是一个字符串，为了区分，加上r

In [2]: import reIn [3]: re.match(r"python","python2")
Out[3]: <_sre.SRE_Match object; span=(0, 6), match='python'>In [4]: re.match(r"python","usr python2")In [5]: re.match(r"python","1python2")In [6]: re.match(r"python","python2")
Out[6]: <_sre.SRE_Match object; span=(0, 6), match='python'>In [7]: res = re.match(r"python","python2")In [8]: res.group()
Out[8]: 'python'

In [9]: re.match(r"python","python2").group()
Out[9]: 'python'

前面的字符串就是正则表达式

1.2.1 点字符

. 代表任何一个字符

不能匹配\n，除了\n其他任意一个字符都可以匹配
如果需要.匹配.本身的含义，而不是任意字符，需要对.进行转义，前面加反斜杠\

In [11]: re.match(r"python3","python3").group()
Out[11]: 'python3'In [12]: re.match(r"python.","python3").group()
Out[12]: 'python3'In [13]: re.match(r"python.","python2").group()
Out[13]: 'python2'In [14]: re.match(r"python.","pythonx").group()
Out[14]: 'pythonx'In [15]: re.match(r"python.","python\n").group()
---------------------------------------------------------------------------
AttributeError                            
Traceback (most recent call last)
<ipython-input-15-51506b890ce1> in <module>
----> 1 re.match(r"python.","python\n").group()AttributeError: 'NoneType' object has no attribute 'group'

1.2.2[]

[字符]匹配其中任意一个字符

[1234]

In [16]: re.match(r"python[123]","python3").group()
Out[16]: 'python3'In [17]: re.match(r"python[1-9]","python3").group()
Out[17]: 'python3'

[-]表示匹配范围内的任意一个字符

[A-Z]匹配范围内的任意一个大写字母
[a-z]小写字母
[0-9]数字

[^] 取反表示禁止匹配集合内的任意一个字符

^ 读作凯瑞特

In [18]: re.match(r"python[^1-9]","python3").group() # 3在里面，取反后不能匹配
-------------------------------------------------------------------
AttributeError                            
Traceback (most recent call last)
<ipython-input-18-9004647059d3> in <module>
----> 1 re.match(r"python[^1-9]","python3").group()AttributeError: 'NoneType' object has no attribute 'group'
----------------------------------------------------------------
In [19]: re.match(r"python[^1-9]","pythonx").group() # x不在0-9里面，所以可以进行匹配
Out[19]: 'pythonx'

1.2.3\d 数字字符 \D非数字字符

\读作反斜杠，平时读作杠

digit

In [21]: re.match(r"python\d","python1").group()
Out[21]: 'python1'In [22]: re.match(r"python\d","pythonx").group()
-------------------------------------------------------------------
AttributeError                            
Traceback (most recent call last)
<ipython-input-22-777a9faf30b4> in <module>
----> 1 re.match(r"python\d","pythonx").group()AttributeError: 'NoneType' object has no attribute 'group'
-------------------------------------------------------------------
In [23]: re.match(r"python\D","pythonx").group()
Out[23]: 'pythonx'In [24]: re.match(r"python\D","python1").group()
-------------------------------------------------------------------
AttributeError                            
Traceback (most recent call last)
<ipython-input-24-fe3167699886> in <module>
----> 1 re.match(r"python\D","python1").group()AttributeError: 'NoneType' object has no attribute 'group'

\w 单词字符（数字、字母、下划线）[0-9a-zA-Z_]

\w大写非单词字符 [^0-9a-zA-Z_]

Word

In [25]: re.match(r"python\w","python1").group()
Out[25]: 'python1'In [26]: re.match(r"python\w","pythona").group()
Out[26]: 'pythona'In [27]: re.match(r"python\w","pythonA").group()
Out[27]: 'pythonA'In [28]: re.match(r"python\w","python_").group()
Out[28]: 'python_'In [29]: re.match(r"python\w","python$").group()
---------------------------------------------------------------------------
AttributeError                            
Traceback (most recent call last)
<ipython-input-29-8dd832fc8383> in <module>
----> 1 re.match(r"python\w","python$").group()AttributeError: 'NoneType' object has no attribute 'group'In [30]: re.match(r"python\W","python$").group()
Out[30]: 'python$'

\s [\r\n\f\v\t]

space 空白字符空格是一种空白字符，但是空白字符不仅仅是空白

\S [^\r\n\f\v\t]

In [31]: re.match(r"python\s","python\r").group()
Out[31]: 'python\r'In [32]: re.match(r"python\s","python\t").group()
Out[32]: 'python\t'In [33]: re.match(r"python\s","python\f").group()
Out[33]: 'python\x0c'In [34]: re.match(r"python\s","python\v").group()
Out[34]: 'python\x0b'In [35]: re.match(r"python\s","python\n").group()
Out[35]: 'python\n'In [36]: re.match(r"python\s","python1").group()
-------------------------------------------------------------------
AttributeError                            
Traceback (most recent call last)
<ipython-input-36-f09acbc6feed> in <module>
----> 1 re.match(r"python\s","python1").group()AttributeError: 'NoneType' object has no attribute 'group'
--------------------------------------------
In [37]: re.match(r"python\S","python1").group()
Out[37]: 'python1'

1.2.4 扩展

默认情况下点 . 是不能匹配\n的可以使用re.S模式让.匹配任意字符

re.match(r"python.org","python\norg",re.S).group()

python3中的\w还可以匹配汉字因为re.U模式（unicode万国码）

re.A （ASCII码）如果想只匹配数字字母下划线，在后面加前面的代码

python2中默认使用re.A \w只匹配数字字母下划线

1.3 量词-匹配数量的字符

量词：匹配数量的字符

注意大括号里面不能有空格

{m,n}匹配至少m次，至多n次

{m} 匹配m次

{m,} 至少m次，无上限

+匹配至少一次

*匹配任意次，包括0次

?匹配0次或者1次

In [43]: re.match(r"嫦娥\d号","嫦娥1号").group() #匹配一次

Out[43]: '嫦娥1号'

In [44]: re.match(r"嫦娥\d{1,10}号","嫦娥1号").group()# 匹配1到10次

Out[44]: '嫦娥1号'

-------------------------------------------------------------------
In [51]: re.match(r"嫦娥\d{1,}号","嫦娥10000号").group() # 无上限，1到

Out[51]: '嫦娥10000号'

In [52]: re.match(r"嫦娥\d{,2}号","嫦娥10000号").group() # 无下限 可以是0

-------------------------------------------------------------------
In [54]: re.match(r"嫦娥\d{2,2}号","嫦娥10号").group()
Out[54]: '嫦娥10号'

In [55]: re.match(r"嫦娥\d{2}号","嫦娥10号").group() # 只匹配2次

Out[55]: '嫦娥10号'

--------------------------------------------------------------
In [56]: re.match(r"嫦娥\d{20,10}号","嫦娥10号").group()
-----------------------------------------------------
In [57]: re.match(r"嫦娥\d+号","嫦娥10000号").group()# 至少一次

Out[57]: '嫦娥10000号'

In [58]: re.match(r"嫦娥\d*号","嫦娥10000号").group()# 匹配任意次，包括0次

Out[58]: '嫦娥10000号'

In [59]: re.match(r"嫦娥\d{0,1}号","嫦娥1号").group()
Out[59]: '嫦娥1号'

In [60]: re.match(r"嫦娥\d?号","嫦娥1号").group() # 匹配0次或1次

Out[60]: '嫦娥1号'

1.4匹配位置

^匹配的是开始位置区分：[^]取反

$匹配的是结束位置

r”^正则$”

import re
def main():
   list = []
    while True:
       s = input("请添加您要匹配的邮箱：")
       list.append(s)
        for ret in list:      
            try:
               ret =re.match(r"^\w{4,16}@qq\.com$",s,re.A).group()
               # $匹配的是结束位置 ^匹配的是开始位置
               # match自带了^符号 为了规范，还是写上，不写不错
               print(ret,"是qq邮箱")
            except Exception:
               print("不是qq邮箱")
if __name__ == '__main__':
   main()

1.5匿名分组group

从一大堆数据中提取自己想要的数据

目的：从整体的数据中提取出感兴趣的部分数据

In [74]: re.match(r"嫦娥(\d+)号","嫦娥998号").group(1)
Out[74]: '998'

创建：“(正则)”将我们感兴趣的数据用小括号括起来

用户创建的分组从1开始， 0 号分组已经存储整体结果

获取分组结果：

匹配结果对象.group(分组编号=0)
group(编号，编号) 是一个元组

In [81]: re.match(r"(\d{3,4})-(\d{6,8}) \1-\2","010-000001 010-000001").group(1,2)
Out[81]: ('010', '000001')

分组引用：

希望在正则的后续位置使用前面的分组匹配的数据
\分组的编号

In [78]: re.match(r"(\d{3,4})-(\d{6,8})","010-000001").group(1)
Out[78]: '010'

# 将区号和座机后进行分组，取出1号组区号
In [79]: re.match(r"(\d{3,4})-(\d{6,8})","010-000001").group(2)
Out[79]: '000001'

# 将区号和座机后进行分组，取出2号组座机号
In [80]: re.match(r"(\d{3,4})-(\d{6,8}) \1-\2","010-000001 010-000001").group(2)
# 需要将前面的号码和后面的号码进行一个比较，直接写\1来引用，本次的代码中因为用户自己编的有两组数据，所有需要引用\1-\2，这就大大增加了代码的复用性
Out[80]: '000001'

In [75]: re.match(r"嫦娥(\d+)号 \1","嫦娥998号998").group(1)

1.5.1扩展

我们进行的分组没有名字，只有下标。
向列表一样，而字典就有名字

1.6命名分组（有名分组）

应用场景：默认分组没有名称，只能按照分组编号访问，而一但分组编号发生变化导致正则修改给每个分组起一个名字编号发生变化不会影响正则的使用

创建：

“(?P<分组名称>正则)”

获取结果：

.group(分组名称)
也可以通过下边进行访问，但是没有什么意义

分组引用：

“(?P<分组名称>正则)”(?P=分组名称)

In [1]: import reIn [2]: re.match(r"(?P<area>\d{3,4})-(?P<nomber>\d{6,8}) (?P=area)-(?P=nomber)","010-000001 010-000001").group("area")
Out[2]: '010'

In [3]: re.match(r"((?P<area>\d{3,4})-(?P<nomber>\d{6,8})) (?P=area)-(?P=nomber)","010-000001 010-000001").group("area")
Out[3]: '010'

In [4]: re.match(r"((?P<area>\d{3,4})-(?P<nomber>\d{6,8})) (?P=area)-(?P=nomber)","010-000001 010-000001").group("nomber")
Out[4]: '000001'

In [5]: re.match(r"(?P<area>\d{3,4})-(?P<nomber>\d{6,8}) (?P=area)-(?P=nomber)","010-000001 010-000001").group(1)
Out[5]: '010'

In [6]: re.match(r"((?P<area>\d{3,4})-(?P<nomber>\d{6,8})) (?P=area)-(?P=nomber)","010-000001 010-000001").group(1)
Out[6]: '010-000001'

1.7 分组的其他使用

r”表达式1|表达式2|表达式3” 匹配|左右任意一个表达式即可

r”表达式(部分1|部分2|部分3)” 匹配分组中|左右任意一个表达式即可

In [12]: re.match(r"^\w{4,16}@163\.com$|^\w{4,16}@263\.com$","123456@263.com").group()
Out[12]: '123456@263.com'
# 上面的是整体，这样写可以，但是比较啰嗦
-------------------------------------------------------------------
In [13]: re.match(r"^\w{4,16}@(163|263|qq)\.com$","123456@263.com").group()
Out[13]: '123456@263.com'In [14]: re.match(r"^\w{4,16}@(163|263|qq)\.com$","123456@qq.com").group()
Out[14]: '123456@qq.com'In [15]: re.match(r"^\w{4,16}@(163|263|qq)\.com$","123456@163.com").group()
Out[15]: '123456@163.com'
------------------------------------------------------
# 上面这样是将正则中不一样的部分进行|，这样大大简化了代码，让我们书写的时候更加的方便。

1.8re模块的高级用法

findall 查找

sub 替换

split 切割

search 只找一次

search(正则,数据) → 匹配结果对象，如果成功返回对象，失败返回None

1）从头开始往后搜索并且尝试匹配如果匹配失败继续往后尝试，直到搜索完成

In [20]: re.search(r"\d+","python=100 cpp=96").group()
Out[20]: '100'

findall(正则,数据) —→匹配结果构成的列表

1）查找数据中所有满足正则规律的数据，返回一个列表

2）默认显示的是最后一个分组的数据

3）取消分组因为findall的结果默认是用户创建的分组数据，需要取消用户创建的分组从而显示整体结果。

In [21]: re.findall(r"\d+","python=100 cpp=96")
Out[21]: ['100', '96']

sub(参数1-正则,参数2-替换的数据,参数3-数据,参数4-次数) —→数据被替换之后的结果

1）查找参数3中符合参数1 规则的数据替换为参数2 参数4次数

2）次数默认为替换所有

In [26]: re.sub(r"\d{2,3}","99","python=100 cpp=96")
Out[26]: 'python=99 cpp=99'

如果参数2为“”相当于删除了满足参数1规则的数据

In [34]: re.sub(r"\d{2,3}","","python=100 cpp=96")
Out[34]: 'python= cpp='

了解 - 参数2可以使一个函数的引用

def 函数名(匹配结果对象):

根据匹配结果对象获取数据

对数据进行处理

返回出路之后的结果

In [27]: def addnone(matchobj):
   ...:     """sub在替换数据的时候能够自动调用 返回值是替换后的数据 参数匹
   ...: 配结果对象"""
   ...:     number = int(matchobj.group())
   ...:     number += 1
   ...:     return str(number)In [28]: re.sub(r"\d{2,3}",addnone,"python=100 cpp=96")
Out[28]: 'python=101 cpp=97'

split(正则,数据)—→ 切割之后的结果构成的列表

In [30]: data = "貂蝉:杨玉环，西施:凤姐"

# 上面的，使用的是中文状态下的，

In [31]: re.split(r":",data)
Out[31]: ['貂蝉', '杨玉环，西施', '凤姐']In [32]: re.split(r":|,",data)
Out[32]: ['貂蝉', '杨玉环，西施', '凤姐']
# 上面对：或者，切割，但是使用英文状态下的，所以没有切割开

In [33]: re.split(r":|，",data)
Out[33]: ['貂蝉', '杨玉环', '西施', '凤姐']
# 上面只是将，切换为中文状态下的，为了对应一开始的字符串中的格式，然后就切割完了

1.9 贪婪与非贪婪（懒惰）

默认是贪婪模式尽可能多匹配

懒惰尽可能少匹配

将贪婪模式变为懒惰模式，量词后加?即可

前提：满足整体的匹配结果

re.search(r"https://.+\.jpg",url).group()
re.search(r"https://.+?\.jpg",url).group()
# 理解贪婪模式和非贪婪模式之后，在看下面的代码

re.findall(r"https://.+?\.jpg|https://.+?\.png",url)
re.findall(r"https://.+?\.(?:jpg|png)",url)

\a和\b在系统中已经有了，算一个字符

1.10r的作用

如果字符串数据中有双斜杠正则需要四反斜杠来进行匹配

为了解决反斜杠困扰的问题，使用r标识数据

自动对字符串中的\进行转移 ——→双反斜杠\\

r"\1" ===>“\\1”  # 自动对数据中的\进行转义----->双反斜杠\\

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-01-18，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法

正则表达式

本文分享自全栈技术精选微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

编程算法

正则表达式

登录后参与评论

0 条评论

热度