正则表达式介绍

iOSDevLog

发布于 2019-04-18 16:38:04

4.8K0

发布于 2019-04-18 16:38:04

文章被收录于专栏：iOSDevLogiOSDevLog

什么是正则表达式？

Regex（英语：Regular Expression，在代码中常简写为 regex 、 regexp 或 RE ），又称正规表示式、正規表示法、正規運算式、規則運算式、常規表示法，是计算机科学的一个概念，正则表达式是一种编写匹配字符串的模式的方法。通常这些模式可用于搜索特定事物的字符串，或搜索然后替换某些事物等。正则表达式非常适合字符串操作！

为什么正则表达式很重要？

从本指南的第一段开始，您可能已经猜到了，但 每当您必须处理字符串 时正则表达式非常有用。从源码中一组类似命名变量的基本重命名到数据预处理。正则表达式通常提供简洁的方式来表达您想要查找的任何类型的事物。例如，如果你想解析一个表格并寻找某人可能出生的年份，你可以使用类似 (19) |(20) [0-9][0-9] 的东西。这是正则表达式的一个例子！

先决条件

本指南不假设任何先验知识。示例将使用 Python 编码，但既不假设也不需要掌握编程语言。欢迎您在浏览器中阅读该指南或下载该指南并运行示例/使用它们进行操作。

可以创建的最简单的正则表达式只由常规字符组成。如果你想在文本中找到所有出现的单词 "Virgilio" ，你可以编写正则表达式Virgilio。在这个正则表达式中，没有角色做任何特殊或不同的事情。实际上，这个正则表达式只是一个普通的单词。没关系，毕竟正则表达式是字符串！

如果给你的文字"Project Virgilio is great"，你可以使用你的 Virgilio 正则表达式找到单词 "Virgilio" 的出现。但是，如果文本是_"Project virgilio is great"，那么你的正则表达式将不起作用，因为正则表达式默认情况下是区分大小写，因此应该完全匹配所有内容。我们说 Virgilio 字面上符合字符序列 "Virgilio"。

使用 Python re

要检查我们的正则表达式是否运行良好并让您有机会直接进行实验，我们将使用 Python 的 re 模块来处理正则表达式。要使用 re 模块，我们首先导入它，然后定义一个正则表达式，然后在字符串上使用 search() 函数！真简单：

import re

regex = "Virgilio"
str1 = "Project Virgilio is great"
str2 = "Project virgilio is great"

if re.search(regex, str1):
    print("'{}' is in '{}'".format(regex, str1))
else:
    print("'{}' is not in '{}'".format(regex, str1))
    
if re.search(regex, str2):
    print("'{}' is in '{}'".format(regex, str2))
else:
    print("'{}' is not in '{}'".format(regex, str2))

'Virgilio' is in 'Project Virgilio is great'
'Virgilio' is not in 'Project virgilio is great'

re.search(regex，string) 函数将正则表达式作为第一个参数，然后搜索作为第二个参数给出的字符串上的任何匹配项。但是，函数的返回值是不是一个布尔值，而是一个 匹配对象 ：

print(re.search(regex, str1))

<re.Match object; span=(8, 16), match='Virgilio'>

匹配对象具有关于遇到的匹配的相关信息：开始和结束位置，匹配的字符串，甚至是更复杂的正则表达式的一些其他内容。

我们可以看到，在这种情况下匹配与正则表达式完全相同，因此看起来匹配对象内部的 match 信息是无关紧要的……但是只要我们将选项或重复引入到我们的正则表达式。

如果没有找到匹配项，则 .search() 函数返回 None ：

print(re.search(regex, str2))

None

每当匹配不是 None 时，我们可以保存返回的匹配对象并使用它来提取所有需要的信息！

m = re.search(regex, str1)
if m is not None:
    print("The match started at pos {} and ended at pos {}".format(m.start(), m.end()))
    print("Or with tuple notation, the match is at {}".format(m.span()))
    print("And btw, the actual string matched was '{}'".format(m.group()))

The match started at pos 8 and ended at pos 16
Or with tuple notation, the match is at (8, 16)
And btw, the actual string matched was 'Virgilio'

现在你应该尝试你自己的文字正则表达式来匹配更多和处理匹配失败。我提供了三个我自己的例子：

m1 = re.search("regex", "This guide is about regexes")
if m1 is not None:
    print("The match is at {}\n".format(m1.span()))

m2 = re.search("abc", "The alphabet goes 'abdefghij...'")
if m2 is None:
    print("Woops, did I just got the alphabet wrong..?\n")
    
s = "aaaaa aaaaaa a aaa"
m3 = re.search("a", s)
if m3 is not None:
    print("I just matched '{}' inside '{}'".format(m3.group(), s))

The match is at (20, 25)

Woops, did I just got the alphabet wrong..?

I just matched 'a' inside 'aaaaa aaaaaa a aaa'

查找

对吗？好吧，点之后会发生什么？一个无限的数字序列，对吗？可能是您的出生日期出现在

的前一百万位数？好吧，我们可以使用正则表达式来找出它！更改下面的 regex 变量，以

的前百万位数字查找您的出生日期或您想要的任何数字！

pifile = "regex-bin/pi.txt"
regex = ""  # define your regex to look your favourite number up

with open(pifile, "r") as f:
    pistr = f.read()  # pistr is a string that contains 1M digits of pi
    
## search for your number here

要搜索

的前 1 亿位数字(或 2 亿，我没有真正使用它)你可以查看这个网站。

匹配选项

我们刚看到一个非常简单的正则表达式试图在文本中找到 "Virgilio" 这个词，但我们也发现我们没有灵活性，甚至无法处理有人可能忘记将名称大写的事实正确地拼写它像 "virgilio" 。

为了防止这样的问题，可以以处理不同可能性的方式编写正则表达式。对于我们的情况，我们希望第一个字母是 "V" 或 "v" ，然后应该是 "irgilio" 。

为了处理不同的可能性，我们使用字符 | 。例如， V | v 与字母 vee 匹配，无论其大小写如何：

v = "v"
V = "V"
regex = "v|V"
if re.search(regex, v):
    print("small v found")
if re.search(regex, V):
    print("big V found")

small v found
big V found

现在我们可以连接第一个字母的正则表达式和 irgilio 正则表达式(对于名称的其余部分)来获得与Virgilio名称匹配的正则表达式，无论其第一个字母的大小写如何：

virgilio = "virgilio"
Virgilio = "Virgilio"
regex = "(V|v)irgilio"
if re.search(regex, virgilio):
    print("virgilio found!")
if re.search(regex, Virgilio):
    print("Virgilio found!")

virgilio found!
Virgilio found!

请注意，我们用括号编写正则表达式：(V|v)irgilio

如果我们只写 V|virgilio ，那么正则表达式将匹配 "V" 或 "virgilio" ，而不是 "Virgilio" 或 "virgilio" ：

regex = "V|virgilio"
print(re.search(regex, "This sentence only has a big V"))

<re.Match object; span=(29, 30), match='V'>

所以我们真的需要在那里用括号括起来 (V|v) 。如果我们这样做，它将按预期工作！

regex = "(V|v)irgilio"
print(re.search(regex, "The name of the project is virgilio, but with a big V!"))
print(re.search(regex, "This sentence only has a big V"))

<re.Match object; span=(27, 35), match='virgilio'>
None

也许你甚至没有注意到，但还有其他事情发生了！请注意，我们使用了 |, ( 和 ) ，并且这些不存在于单词 "virgilio" 中，但是我们的正则表达式 (V|v)irgilio 匹配它......是因为这三个字符在正则表达式世界中具有特殊含义，因此不是字面上的解释，与 irgilio 中的任何字母发生的情况相反。

Virgilio 还是 Virgil？

以下是维基百科关于维吉尔的文章的几段话：

Publius Vergilius Maro(古典拉丁语：[puː.blɪ.ʊswɛrɡɪ.lɪ.ʊssa.roː];传统日期公元前70年10月15日 - 公元前19年9月21日[1])，通常称为Virgil或Vergil(/vɜːrdʒɪl /)在英语中，是古代罗马诗人的奥古斯都时期。他写了三首最着名的拉丁文学诗：Eclogues(或Bucolics)，Georgics和史诗Aeneid。附录Vergiliana收集的一些小诗有时归于他。[2] [3]
维吉尔传统上被评为罗马最伟大的诗人之一。他的埃涅伊德自成立以来一直被认为是古罗马的民族史诗。以荷马的伊利亚特和奥德赛为蓝本，埃涅伊德追随特洛伊难民埃涅阿斯，因为他努力实现自己的命运并到达意大利，在那里他的后代罗穆卢斯和雷木思将建立罗马城。维吉尔的作品对西方文学产生了广泛而深远的影响，尤其是但丁的神曲，其中维吉尔作为通过地狱和炼狱的丹特指南出现。

"Virgilio"是意大利形式的"Virgil"，我编辑了上面的段落以获得意大利语版本而不是英语版本。我要你还原吧！

你可能想看看Python中的while周期，[string indexing](https://www.digitalocean.com/community/教程/ how-to-index-and-slice-strings-in-python-3)和string concatenation。关键是你找到了一个匹配，你将字符串分解为before匹配部分和after匹配部分，然后将这两个与Virgilio粘合在一起。

请注意，字符串替换可能会更快更容易，但这会破坏本练习的目的。修复所有内容后，打印最终结果以确保您修复了每次出现的名称。

paragraphs = \
"""Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory."""

匹配重复

有时我们想要找到具有可重复位的模式。例如，当人们看到像婴儿一样可爱的东西时，人们会发出 "awww" 或 "owww" 声音。但我在那里使用 "w" 的数量完全是武断的！如果宝宝真的很可爱，有人可能会写 "awwwwwwwwwww" 。那么我怎么能写一个匹配 "aww" 和 "oww" 的正则表达式，但是有任意数量的字符 "w" ？

我将通过针对以下字符串测试正则表达式来说明捕获重复的几种方法：

"awww" (3 letters "w")
"awwww" (4 letters "w")
"awwwwwww" (7 letters "w")
"awwwwwwwwwwwwwwww" (16 letters "w")
"aw" (1 letter "w")
"a" (0 letters "w")

cute_strings = [
    "awww",
    "awwww",
    "awwwwwww",
    "awwwwwwwwwwwwwwww",
    "aw",
    "a"
]

def match_cute_strings(regex):
    """Takes a regex, prints matches and non-matches"""
    for s in cute_strings:
        m = re.search(regex, s)
        if m:
            print("match: {}".format(s))
        else:
            print("non match: {}".format(s))

至少一次

如果我想匹配所有包含至少一个 "w" 的字符串，我们可以使用字符 + 。一个 + 意味着我们想要找到 左边的任何一个或多个重复 。例如，正则表达式 "a+" 将匹配任何至少有一个 "a" 的字符串。

regex = "aw+"
match_cute_strings(regex)

match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
match: aw
non match: a

任意次数

如果我想匹配包含任意数量字母 "w" 的所有字符串，我可以使用字符 * 。字符 "" 表示 匹配任意数量的重复 ，无论其左边是什么，甚至0次重复！因此，正则表达式 "a" 将匹配空字符串 ""，因为空字符串 "" 具有 0 个字母 "a" 的重复。

regex = "aw*"
match_cute_strings(regex)

match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
match: aw
match: a

特定次数

如果我想匹配包含特定粒子的字符串特定次数，我可以使用 {n} 表示法，其中 n 被我想要的重复次数所取代。例如， a{3} 匹配字符串 "aaa" 但不匹配字符串 "aa" 。

regex = "aw{3}"
match_cute_strings(regex)

match: awww
match: awwww
match: awwwwwww
match: awwwwwwwwwwwwwwww
non match: aw
non match: a

等一下，为什么模式 aw{3} 匹配更长的可爱表达，比如 "awwww" 或 "awwwwwww" ？因为正则表达式试图找到与模式匹配的子串。我们的模式是 awww (如果我明确地写了 w{3} 并且字符串 awwww 有那个子字符串，就像字符串 awwwwwww 有它，或者更长的版本 16 个字母 "w" 。如果我们想要排除字符串 "awwww" ， "awwwwwww" 和 "awwwwwwwwwwwwww" ，我们将不得不修复我们的正则表达式。一个更好的例子来说明 {n} 的工作方式是考虑 "wow", "woow" 和 "wooooooooooooow" 娱乐表达，而不是表达可爱。我们定义了一些娱乐表达方式：

"wow"
"woow"
"wooow"
"woooow"
"wooooooooow"

现在我们测试我们的{3}模式。

wow_strings = [
    "wow",
    "woow",
    "wooow",
    "woooow",
    "wooooooooow"
]

def match_wow_strings(regex):
    """Takes a regex, prints matches and non-matches"""
    for s in wow_strings:
        m = re.search(regex, s)
        if m:
            print("match: {}".format(s))
        else:
            print("non match: {}".format(s))

regex = "wo{3}w"
match_wow_strings(regex)

non match: wow
non match: woow
match: wooow
non match: woooow
non match: wooooooooow

介于

和

之间

只用三个 "o" 来表达娱乐是可以的，但是人们也可以使用两个或四个 "o" 。我们如何捕获可变数量的字母，但是在一定范围内？假设我只想捕获 2 到 4 个字母 "o" 之间的 "哇" 版本。我可以用 {2,4} 做到这一点。

regex = "wo{2,4}w"
match_wow_strings(regex)

non match: wow
match: woow
match: wooow
match: woooow
non match: wooooooooow

最高

次或至少

次

现在我们正在玩我们可能想要的重复类型，但当然我们可能会说我们想要 不超过

重复，你可以用 {，n} 实现或者我们做想要至少

重复，你可以用 {m，} 做到。

实际上，看看这些正则表达式：

regex = "wo{,4}w" # should not match strings with more than 4 o's
match_wow_strings(regex)

match: wow
match: woow
match: wooow
match: woooow
non match: wooooooooow

regex = "wo{3,}w" # should not match strings with less than 3 o's
match_wow_strings(regex)

non match: wow
non match: woow
match: wooow
match: woooow
match: wooooooooow

生存还是毁灭

regex = "(V|v)irgil(io)?"
names = ["virgil", "Virgil", "virgilio", "Virgilio"]
for name in names:
    m = re.search(regex, name)
    if m:
        print("The name {} was matched!".format(name))

The name virgil was matched!
The name Virgil was matched!
The name virgilio was matched!
The name Virgilio was matched!

贪婪

+，？，*和{，}运算符都是贪婪的。这是什么意思？这意味着他们会尽可能地匹配。它们具有此默认行为，而不是在满足正则表达式时停止尝试查找更多匹配项。为了更好地说明我的意思，让我们再看一下我们一直处理的 match 对象中包含的信息：

regex = "a+"
s = "aaa"
m = re.search(regex, s)
print(m)

<re.Match object; span=(0, 3), match='aaa'>

注意打印信息中写着 match='aaa' 的部分。函数 m.group() 会让我知道正则表达式匹配的实际字符串是什么，在这种情况下它是 "aaa" 。好吧，我写的正则表达式，a+，将匹配一或多个字母 "a" 。如果我在字符串上使用正则表达式并得到匹配，如果我无法访问该类型的信息，我怎么能知道匹配了多少 "a" ？如果我无法访问该类型的信息呢？

print(m.group())

aaa

因此，让我们验证一下，事实上，我提到的操作都是贪婪的。同样，因为它们都匹配尽可能多的角色。

下面，我们看到给出一个 30 个字母 "a" 的字符串，

模式 a? 匹配 1 个 "a"，这是尽可能多的
模式 a+ 匹配 30 个 "a"，这是尽可能多的
模式 a* 也匹配 30 个
模式 a{5,10} 匹配10 个 "a"，这是我们施加的限制

s = "a"*30
print(re.search("a?", s).group())
print(re.search("a+", s).group())
print(re.search("a*", s).group())
print(re.search("a{5,10}", s).group())

a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaa

如果我们不希望我们的操作员贪婪，我们只需在它们之后添加一个 ? 。所以下面的正则表达式不是贪婪的：

- 模式 `a??` 将 **不** 匹配字符，很像 `a*?` ，因为现在他们的目标是尽可能少地匹配。但是长度为 0 的匹配是最短的匹配！
- 模式 `a+?` 只匹配 1 个 "a"
- 模式 `a{5,10}?` 只匹配 5 个 "a"

我们可以通过运行下面的代码轻松确认我刚才所说的内容。请注意，现在我以不同的方式打印东西，因为否则我们将无法看到 a?? 和 a*? 模式没有匹配。

s = "a"*30
print("'{}'".format(re.search("a??", s).group()))
print("'{}'".format(re.search("a+?", s).group()))
print("'{}'".format(re.search("a*?", s).group()))
print("'{}'".format(re.search("a{5,10}?", s).group()))

''
'a'
''
'aaaaa'

删除多余的空格

现在我们知道了重复，我将告诉你关于 sub 函数的信息，我们将使用它来解析一段文本并删除所有存在的额外空格。输入 re.sub(regex，rep，string) 将在给定的字符串上使用给定的正则表达式，并且无论何时匹配，它都会删除匹配并将 rep 放在那里。

例如，我可以使用它来替换所有英文/意大利名称 Virgilio 的标准版本：

s = "Virgilio has many names, like virgil, virgilio, Virgil, Vergil, or even vergil."
regex = "(V|v)(e|i)rgil(io)?"

print(
    re.sub(regex, "Virgilio", s)
)

Virgilio has many names, like Virgilio, Virgilio, Virgilio, Virgilio, or even Virgilio.

现在轮到你了。我将把这句话作为输入，你的工作是修复其中的空白。完成后，将结果保存在名为 s 的字符串中，并检查 s.count("") 是否等于0。

weird_text = "Now   it  is your   turn.  I am     going  to give   you this    sentence as        input, and   your  job    is to      fix the     whitespace         in it. When you    are  done,    save the    result in a  string  named   `s`, and   check    if  `s.count("  ")` is   equal   to    0  or not."
regex = ""  # put your regex here

# substitute the extra whitespace here
# save the result in 's'

# this print should be 0
print(s.count("  "))

字符组

到目前为止，我们一直在使用编写一些简单的正则表达式来匹配某些单词，一些名称以及类似的东西。现在我们有一个不同的计划。我们将编写一个与美国电话号码匹配的正则表达式，我们假设它们的格式为 xxx-xxx-xxxx 。前三位数是区号，但我们不关心区号是否真正有意义。那我们怎么匹配呢？

事实上，我怎样才能匹配第一个数字？它可以是0到9之间的任何数字，所以我应该写 (0|1|2|3|4|5|6|7|8|9) 以匹配第一个数字，然后重复？实际上，我们可以做到这一点，是的，获得这个正则表达式：

(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){4}

这有用吗？

regex = "(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){4}"
numbers = [
    "202-555-0181",
    "202555-0181",
    "202 555 0181",
    "512-555-0191",
    "96-125-3546",
]
for nr in numbers:
    print(re.search(regex, nr))

<re.Match object; span=(0, 12), match='202-555-0181'>
None
None
<re.Match object; span=(0, 12), match='512-555-0191'>
None

它看起来很有效，但肯定有更好的方法......而且有！我们实际上可以编写一系列值，而不是像我们一样写出每一个数字！事实上，正则表达式 [0-9] 匹配从 0 到 9 的所有数字。所以我们实际上可以将我们的正则表达式缩短为 [0123456789]{3}-[0123456789]{3}-[0123456789]{4} ：

regex = "[0-9]{3}-[0-9]{3}-[0-9]{4}"
numbers = [
    "202-555-0181",
    "202555-0181",
    "202 555 0181",
    "512-555-0191",
    "96-125-3546",
]
for nr in numbers:
    print(re.search(regex, nr))

<re.Match object; span=(0, 12), match='202-555-0181'>
None
None
<re.Match object; span=(0, 12), match='512-555-0191'>
None

这里的魔力是由 [] 来完成的，它表示一个字符组。 [] 的工作方式是，正则表达式会尝试匹配内部的任何内容，而恰好是 "0-9" 是列出所有数字的较短方式。当然你也可以匹配 [0123456789]{3}-[0123456789]{3}-[0123456789]{4} 这比我们的第一次尝试略短，但仍然非常糟糕。类似于 0-9，我们有 a-z 和 A-Z ，它们遍历字母表中的所有字母。

您也可以在不同的地方开始和结束，例如 c-o 可用于匹配仅使用 "c" 和 "o"之间的字母的单词，如 "hello" ：

regex = "[c-o]+"
print(re.search(regex, "hello"))
print(re.search(regex, "rice"))

<re.Match object; span=(0, 5), match='hello'>
<re.Match object; span=(1, 4), match='ice'>

使用这些字符组，我们实际上可以将我们的 Virgilio 正则表达式重写为略短的东西，从 (V|v)(e|i)rgil(io)? 到 [Vv][ie]rgil(io)?。

s = "Virgilio has many names, like virgil, virgilio, Virgil, Vergil, or even vergil."
regex = "[Vv][ie]rgil(io)?"

print(
    re.sub(regex, "Virgilio", s)
)

Virgilio has many names, like Virgilio, Virgilio, Virgilio, Virgilio, or even Virgilio.

我们再一次看到我们的正则表达式与 ice 中的 rice 匹配，因为 "r" 不在合法的字母范围内，但 ice 是。

字符组 是方括号 [] ，无论里面是什么。另外，请注意我们使用的特殊字符在字符组中失去了意义！所以 [()？+ * {}] 实际上会匹配任何这些字符：

regex = "[()?+*{}]"
print(re.search(regex, "Did I just ask a question?"))

<re.Match object; span=(25, 26), match='?'>

关于字符组的最后一点，如果它们以 ^ 开头，那么我们实际上是在说 "使用除了里面的内容以外的一切"：

regex = "[^c-o]+"
print(re.search(regex, "hello"))
print(re.search(regex, "rice"))

None
<re.Match object; span=(0, 1), match='r'>

电话号码 v1

既然您知道如何使用字符组来表示范围，那么您需要编写一个匹配美国电话号码的正则表达式，格式为 xxx-xxx-xxxx 。不仅如此，您还必须应对这样一个事实，即国家指标可能会或可能不会出现这些数字，您可以假设它看起来像 "+1" 或 "001" 。国家指示符可以用空格或短划线与数字的其余部分分开。

regex = ""  # write your regex here
matches = [  # you should be able to match those
    "202-555-0181",
    "001 202-555-0181",
    "+1-512-555-0191"
]
non_matches = [  # for now, none of these should be matched
    "202555-0181",
    "96-125-3546",
    "(+1)5125550191"
]
for s in matches:
    print(re.search(regex, s))
for s in non_matches:
    print(re.search(regex, s))

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>

特殊字符

是时候提升一点了！我们已经看到一些具有特殊意义的角色，现在我将介绍其中的一些角色！我将从列出它们开始，然后我将更详细地解释它们：

. 用于匹配任何字符，换行符除外
^ 用于匹配字符串的开头
$ 用于匹配字符串的末尾
\d 用于匹配任何数字
\w 用于匹配任何字母数字字符
\s 用于匹配任何类型的空格
\ 用于删除字符的特殊含义

点 `.`

可以在正则表达式中使用.来捕获可能在那里使用过的任何字符，只要我们仍在同一行中。也就是说，.不起作用的唯一地方是我们改变了文本中的行。想象一下这个模式是 d.ck。然后模式将匹配

"duck"

但它不匹配

"d
ck"

因为我们改变了字符串中间的行。

脱字符 `^`

如果我们在正则表达式的开头使用 ^ ，那么我们只关心字符串开头的匹配。也就是说，^wow 只会匹配以 "wow" 开头的字符串：

regex = "^wow"

print(re.search(regex, "wow, this is awesome"))
print(re.search(regex, "this is awesome, wow"))

<re.Match object; span=(0, 3), match='wow'>
None

回想一下，字符组中的 ^ 也可以表示 "除了这个类中的任何内容之外的任何内容" ，因此正则表达式 [^d]uck 将匹配任何包含 uck 的字符串，只要它不是 "duck" 这个词。如果插入符号 ^ 出现在字符组 [] 中但它不是第一个字符，那么它没有特殊含义，它只代表字符本身。这意味着正则表达式 [()^{}] 正在寻找匹配列出的任何字符：

regex = "[()^{}]"
print(re.search(regex, "^"))
print(re.search(regex, "("))
print(re.search(regex, "}"))

<re.Match object; span=(0, 1), match='^'>
<re.Match object; span=(0, 1), match='('>
<re.Match object; span=(0, 1), match='}'>

美元符号 `$`

与插入符号 $ 相反，美元符号仅在字符串末尾匹配！

regex = "wow$"

print(re.search(regex, "wow, this is awesome"))
print(re.search(regex, "this is awesome, wow"))

None
<re.Match object; span=(17, 20), match='wow'>

将 ^ 与 $ 结合起来意味着我们希望将整个字符串与我们的模式相匹配。例如 ^[a-zA-Z ]*$ 检查我们的字符串是否只包含字母和空格而不包含其他内容：

regex = "^[a-zA-Z ]*$"

s1 = "this is a sentence with only letters and spaces"
s2 = "this sentence has 1 number"
s3 = "this one has punctuation..."

print(re.search(regex, s1))
print(re.search(regex, s2))
print(re.search(regex, s3))

<re.Match object; span=(0, 47), match='this is a sentence with only letters and spaces'>
None
None

字符组 `\d` ，`\w` 和 `\s`

每当你看到反斜杠后跟一个字母时，这可能意味着正在进行特殊匹配。这三个特殊的 "字符" 是一些字符组 [] 的简写符号。例如，\d 与 [0-9] 相同。 \w 表示任何字母数字字符(如字母，数字和 _ )，而 \s 表示任何空格字符(如空格 ""，制表符，换行符等)。

我展示的所有这三个特殊字符都可以大写。如果他们是，那么他们的意思恰恰相反！所以 \D 的意思是"除数字之外的任何字符"，\W 表示 "除字母数字之外的任何字符"，而 \S 表示 "除空格之外的任何字符"。

regex = "\D+"
s = "these are some words"
print(re.findall(regex, s))

['these are some words']

除此之外，这些特殊字符可以在字符组中使用，例如[abc \ d]将匹配任何数字和字母"a"，"b"和"c"。如果使用了插入符号^，那么我们将排除特殊字符所指的任何内容。例如，如果[\ d]匹配任何数字，那么[^ \ d]将匹配任何不是数字的东西。

反斜杠 `\`

我们已经看到在字母之前使用反斜杠给它们一些特殊含义......好吧，特殊字符之前的反斜杠也剥夺了它的特殊含义！所以，如果你想匹配一个反斜杠，你可以使用 \\ 。如果你想匹配我们已经看过的任何其他特殊字符，你可以在它们之前添加一个 \ ，比如 \+ 来匹配一个加号。下一个正则表达式可用于匹配添加表达式，如 "16 + 6"

regex = "[\d]+ ?\+ ?[\d]+"
add1 = "16 + 6"
add2 = "4325+2"
add3 = "4+ 564"
mult1 = "56 * 2"

print(re.search(regex, add1))
print(re.search(regex, add2))
print(re.search(regex, add3))
print(re.search(regex, mult1))

<re.Match object; span=(0, 6), match='16 + 6'>
<re.Match object; span=(0, 6), match='4325+2'>
<re.Match object; span=(0, 6), match='4+ 564'>
None

电话号码 v2

现在我邀请您查看电话号码 v1 并重写您的正则表达式，以包含一些您之前不知道的新特殊字符！

regex = ""  # write your regex here
matches = [  # you should be able to match those
    "202-555-0181",
    "001 202-555-0181",
    "+1-512-555-0191"
]
non_matches = [  # for now, none of these should be matched
    "202555-0181",
    "96-125-3546",
    "(+1)5125550191"
]
for s in matches:
    print(re.search(regex, s))
for s in non_matches:
    print(re.search(regex, s))

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 0), match=''>

分组

到目前为止，当我们使用正则表达式来匹配字符串时，我们可以通过在匹配对象上使用 .group() 函数来检索匹配的全部信息：

regex = "my name? is"

m = re.search(regex, "my nam is Virgilio")
if m is not None:
    print(m.group())

my nam is

假设我们再次处理电话号码，我们希望以大文字查找电话号码。但在那之后，我们还希望从数字所在的国家 / 地区提取。我们怎么能这样做..？好吧，我们可以使用正则表达式来匹配电话号码，然后使用第二个正则表达式来提取国家 / 地区代码，对吧？ (我们假设电话号码是按顺序写入数字，没有空格或 "-" 将它们分开。)

regex_number = "((00|[+])\d{1,3}[ -])\d{8,12}"
regex_code = "((00|[+])\d{1,3})"
matches = [  # you should be able to match those
    "+351 2025550181",
    "001 2025550181",
    "+1-5125550191",
    "0048 123456789"
]

for s in matches:
    m = re.search(regex_number, s)  # match the phone number
    if m is not None:
        phone_number = m.group()    # extract the phone number
        code = re.search(regex_code, phone_number)  # match the country code
        print("The country code is: {}".format(code.group()))

The country code is: +351
The country code is: 001
The country code is: +1
The country code is: 0048

但这不仅是重复的，因为我只是将 regex_number 的开头复制到 regex_code 中，但如果我试图检索我的匹配的几个不同部分，它会变得非常麻烦。因此，正则表达式的功能是组。通过对正则表达式的某些部分进行分组，您可以执行诸如使用重复运算符之类的操作，然后 检索其信息。

要进行分组，只需要使用 () 括号。例如，正则表达式 "(ab)+" 查找 "ab" ， "abab" ， "ababab" 等形式的匹配。

我们还使用分组开头来创建一个匹配 "Virgilio" 和 "virgilio" 的正则表达式，编写 (V|v)irgilio 。

现在关注真正重要的部分！我们可以使用分组来检索部分匹配，我们使用 .group() 函数执行此操作！任何一组 () 定义一个组，然后我们可以使用 .group(i) 函数来检索组 i 。请注意，第 0 组始终是整个匹配，然后从左开始计数！

regex_with_grouping = "(abc) (de(fg)hi)"
m = re.search(regex_with_grouping, "abc defghi jklm n opq")
print(m.group())
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.groups())

abc defghi
abc defghi
abc
defghi
fg
('abc', 'defghi', 'fg')

请注意，match.group()和match.group(0)是一回事。还要注意函数match.groups()返回元组中的所有组！

电话号码 v3

使用您目前所学到的知识，编写一个与不同国家 / 地区代码的电话号码相匹配的正则表达式。假设如下：

国家 / 地区代码以 "00" 或 "+" 开头，后跟一到三位数字
电话号码的长度在 8 到 12 之间
电话号码和国家 / 地区代码用空格 "" 或连字符 "-" 分隔

让您的代码在我接下来要提供的字符串中查找电话号码，并让它打印出它找到的不同国家 / 地区代码。

当正则表达式中包含组时，您可能想要了解 re.findall() 的确切行为。你可以通过检查re 模块的文档来做到这一点。

paragraph = """Hello, I am Virgilio and I am from Italy.
If phones were a thing when I was alive, my number would've probably been 0039 3123456789.
I would also love to get a house with 3 floors and something like +1 000 square meters.
Now that we are at it, I can also tell you that the number 0039 3135313531 would have suited Leo da Vinci very well...
And come to think of it, someone told me that Socrates had dibs on +30-2111112222"""
# you should find 3 phone numbers
# and you should not be fooled by the other numbers that show up in the text

正则表达式的玩具项目

对于玩具项目来说，这远远不是微不足道的，你可以模仿我在这里做的。如果您按照该链接，您将找到一段采用正则表达式的代码，然后打印给定正则表达式匹配的所有字符串。

我将给你几个关于它是如何工作的例子：

import sys
sys.path.append("./regex-bin")
import regexPrinter

def get_iter(regex):
    return regexPrinter.printRegex(regex).print()

def printall(regex):
    for poss_match in get_iter(regex):
        print(poss_match)

regex = "V|virgilio"
printall(regex)
print("-"*30)
regex = "wo+w"
printall(regex)
print("-"*30)
# notice that for some reason, dumb me used {n:m} instead of {n,m}
# also note that I only implemented {n,m}, and not {n,} nor {,m} nor {n}
# also note that this does not support nor \d nor [0-9]
regex = "((00|[+])1[ -])?[0123456789]{3:3}"
printall(regex)

请注意，代码受到保护以防止无限模式，这些模式用 ... 发出信号。

printall("this is infinite!+")

如果你对这类事情完全不熟悉，那么这看起来完全不可能......但事实并非如此，因为我是一个正常的人，我能够做到！所以如果你真的想要你也可以做到！在链接中，您列出了我决定包含的所有功能，例如排除了 \d 。

我只能按照我的方式做到这一点，因为我已经浏览了这个惊人的系列中的一些(不是全部)博客文章。

也许您可以实现较小的功能子集而不会有太多麻烦？这一点是，如果您知道正则表达式如何工作，则只能打印正则表达式匹配的字符串。尝试从仅实现文字匹配和 | 和 ？ 运算符开始。你现在可以包括分组 () 以便 (ab)? 能按预期工作吗？你能添加 [] 吗？那么 + 和 * 呢？或者也许以 {n，m} 开头，分别写 ?, + 和 * 作为 {0,1} , {1,} 和 {0,} 。

你也可以稍微推迟这个项目，并深入挖掘正则表达式的世界。下一节包含一些额外的参考资料和一些练习练习新知识的网站！

进一步阅读

对于Python中的正则表达式，您可以查看 re 模块的文档以及此 regex HOWTO 。

一些很好的主题要跟进，包括但不限于： - 非捕获组 (以及Python的命名组) - 断言 (先行断言，负面，......) - 正则表达式编译和标志(用于 Python ) - 递归正则表达式

这个有趣的网站(以及这一个也提供了一个界面供您输入正则表达式并查看它们匹配的内容文本。该工具还可以解释正则表达式的作用。

我找到了一些有关正则表达式练习的有趣网站。这一个有更多的 "基本" 练习，每个练习都先解释完成练习所需的一切。我建议你仔细阅读。 Hackerrank 和 regexplay 也有一些有趣的练习，但那些要求你登录某种程度上来说。

如果您喜欢本指南和/或它很有用，请考虑在 Virgilio 仓库中 star 并与您的朋友分享！

这是由 Mathspp Blog，RojerGS带给您的编辑。

查找

(已解决)

pifile = "regex-bin/pi.txt"
regex = "9876"  # define your regex to look your favourite number up

with open(pifile, "r") as f:
    pistr = f.read()  # pistr is a string that contains 1M digits of pi
    
## search for your number here
m = re.search(regex, pistr)
if m:
    print("Found the number '{}' at positions {}".format(regex, m.span()))
else:
    print("Sorry, the first million digits of pi can't help you with that...")

Virgilio 还是 Virgil？ (已解决)

paragraphs = \
"""Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]

Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory."""

regex = "(V|v)irgilio"
parsed_str = paragraphs
m = re.search(regex, parsed_str)
while m is not None:
    parsed_str = parsed_str[:m.start()] + "Virgil" + parsed_str[m.end():]
    m = re.search(regex, parsed_str)

print(parsed_str)

删除多余的空格 (已解决)

weird_text = "Now   it  is your   turn.  I am     going  to give   you this    sentence as        input, and   your  job    is to      fix the     whitespace         in it. When you    are  done,    save the    result in a  string  named   `s`, and   check    if  `s.count("  ")` is   equal   to    0  or not."
regex = " +"  # put your regex here
# there are several possible solutions, I chose this one

# substitute the extra whitespace here
s = re.sub(regex, " ", weird_text)

# this print should be 0
print(s.count("  "))
print(s)

电话号码 v1 (已解决)

regex = "((00|[+])1[ -])?[0-9]{3}-[0-9]{3}-[0-9]{4}"  # write your regex here
matches = [  # you should be able to match those
    "202-555-0181",
    "001 202-555-0181",
    "+1-512-555-0191"
]
non_matches = [  # for now, none of these should be matched
    "202555-0181",
    "96-125-3546",
    "(+1)5125550191"
]
for s in matches:
    print(re.search(regex, s))
for s in non_matches:
    print(re.search(regex, s))

`search` 结合 `matched` (已解决)

def my_search(regex, string):
    found = False
    while string:
        m = re.match(regex, string)
        if m:
            return True
        string = string[1:]
    # check if the pattern matches the empty string
    if re.match(regex, string):
        return True
    else:
        return False

regex = "[0-9]{2,4}"

# your function should be able to match in all these strings
string1 = "1984 was already some years ago."
print(my_search(regex, string1))
string2 = "There is also a book whose title is '1984', but the story isn't set in the year of 1984."
print(my_search(regex, string2))
string3 = "Sometimes people write '84 for short."
print(my_search(regex, string3))

# your function should also match with this regex and this string
regex = "a*"
string = ""
print(my_search(regex, string))

计数匹配 `findall` (已解决)

def count_matches(regex, string):
    return len(re.findall(regex, string))

regex = "wow"

string1 = "wow wow wow" # this should be 3
print(count_matches(regex, string1))
string2 = "wowow" # this should be 1
print(count_matches(regex, string2))
string3 = "wowowow" # this should be 2
print(count_matches(regex, string3))

电话号码 v2 (已解决)

regex = "((00|[+])1[ -])?\d{3}-\d{3}-\d{4}"  # write your regex here
matches = [  # you should be able to match those
    "202-555-0181",
    "001 202-555-0181",
    "+1-512-555-0191"
]
non_matches = [  # for now, none of these should be matched
    "202555-0181",
    "96-125-3546",
    "(+1)5125550191"
]
for s in matches:
    print(re.search(regex, s))
for s in non_matches:
    print(re.search(regex, s))

电话号码 v3 (已解决)

对于这个"问题"，人们会想到使用.findall()函数来查找所有匹配项。当我们这样做时，我们没有获得匹配对象的列表，而是获得带有元组的列表，其中每个元组都有一个来自正则表达式的特定组。这是[记录为re.findall()函数]的行为(https://docs.python.org/3/library/re.html#re.findall)。

这很好，因为我们真的只关心数字代码，我们可以轻松打印它。如果我们想要匹配对象，那么替代方法是使用re.finditer()函数。

paragraph = """Hello, I am Virgilio and I am from Italy.
If phones were a thing when I was alive, my number would've probably been 0039 3123456789.
I would also love to get a house with 3 floors and something like +1 000 square meters.
Now that we are at it, I can also tell you that the number 0039 3135313531 would have suited Leo da Vinci very well...
And come to think of it, someone told me that Socrates had dibs on +30-2111112222"""
# you should find 3 phone numbers
# and you should not be fooled by the other numbers that show up in the text

regex = "((00|[+])\d{1,3})[ -]\d{8,12}"
ns = re.findall(regex, paragraph)  # find numbers
for n in ns:
    # n is a tuple with the two groups our string has
    print(n)
    
for n in re.finditer(regex, paragraph):
    print("The number '{}' has country code: {}".format(n.group(), n.group(1)))

原文：https://github.com/clone95/Virgilio/blob/master/zh-CN/Tools/Regex.ipynb

本文参与腾讯云自媒体分享计划，分享自作者个人站点/博客。

原始发表：2019.04.10 ，如有侵权请联系 cloudcommunity@tencent.com 删除

正则表达式

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！