正则表达式

MeteoAI

发布于 2019-08-19 11:14:50

6030

发布于 2019-08-19 11:14:50

文章被收录于专栏：MeteoAI

正则表达式使用单个字符串来描述、匹配一系列符合某个句法规则的字符串。在很多文本编辑器里，正则表达式通常被用来检索、替换那些符合某个模式的文本，比如爬虫工程师可以用正则表达式来匹配网页上的文本数据，自然语言工程师可以用正则表达式匹配出含有敏感词的语句，作为气象工程师，我们可以用正则表达式来处理我们服务器内的日志文件，也可以用来匹配特定规律的模式输出文件名。

假如我们有如下的日志文件

f = open('xxx.log.txt') # 
log = f.read()
print(log)

'2018-01-16 09：14：35   reading  EC DATA\n
2018-01-16 10：17：37    reprocess the EC DATA\n
2018-01-17 18：18：38    put into WRF,\n
2018-01-22  16：17：37   extract the grid data to nearest station, merge with actual data, save to Mysql database \n
2018-01-24 17：14：39    extract the data from Mysql and put the station data to CNN-LSTM model\n
2018-01-24 22：12：39    training the  CNN-LSTM model\n
2018-01-25 17：09：22    tuning\n
2018-01-25 17：09：22    save the best model\n
2018-01-26 19：23：55    predict  the wind speed\n
2018-01-27 06：09：45    save and evaluate\n\n\n'

log中的日期格式为yyyy-mm-dd，如果甲方爸爸突然要求我们把日期全部改成mm/dd/yyyy，我们应该如何是好？

这个时候正则表达式就可以派上用场了，首先我们匹配出年月日,并打印出来检验一下,说明匹配出来的日期是正确的。

import re
pattern = r'\d{4}-\d{2}-\d{2}'
print(re.findall(pattern,log))

['2018-01-16',
 '2018-01-16',
 '2018-01-17',
 '2018-01-22',
 '2018-01-24',
 '2018-01-24',
 '2018-01-25',
 '2018-01-25',
 '2018-01-26',
 '2018-01-27']

我们再对上面的表达式进行年月日分组（即加个括号），并进行重新排序,以前默认的123 改成231，后使用re.sub进行位置替换

pattern_ed=r'(\d{4})-(\d{2})-(\d{2})'
sub_order =r'\2/\3/\1' # 重新排序
print(re.sub(pattern_ed,sub_order,log))

'01/16/2018 09：14：35   reading  EC DATA\n
01/16/2018 10：17：37   reprocess the EC DATA\n01/17/2018 18：18：38   put into WRF,\n
01/22/2018  16：17：37  extract the grid data to nearest station, merge with actual data, saved at Mysql database \n
01/24/2018 17：14：39   extract the data fro Mysql and put the station data to CNN-LSTM model\n
01/24/2018 22：12：39   training the  CNN-LSTM model\n
01/25/2018 17：09：22   tuning\n
01/25/2018 17：09：22   save the best model\n
01/26/2018 19：23：55   predict  the wind speed\n
01/27/2018 06：09：45   save and plot\n\n\n'

实际上，我们还可以对各个分组命名，即：

pattern_ed=r'(?P<day>\d{4})-(?P<month>\d{2})-(?P<year>\d{2})'
sub_order = r'\g<month>/\g<day>/\g<year>'
print(re.sub(pattern_ed,sub_order,log))

效果和上面的一致。

上面的案例只为抛砖引玉，下面我们来正式学习正则表达式基础吧。本文将结合python的re模块来讲解正则表达式的使用。

1.基本匹配

正则表达式其实就是在执行搜索时的格式, 它由一些字母和数字组合而成[1]. 例如: 一个正则表达式 d03, 它表示一个规则: 由字母d开始,接着是0,再接着是3，它逐个字符地与输入的正则表达式做比较。正则表达式对大小写敏感，所以D03 不会匹配d03

import re
text = 'WRF_d03_hunan_20190608_16:00:00'
regex_1 = 'd03'
regex_2 = 'D03'
print('匹配出：',re.findall(regex_1,text))
print('匹配出：',re.findall(regex_2,text))

匹配出：['d03']
匹配出：[]

2.元字符

正则表达式主要依赖于元字符. 元字符不代表他们本身的字面意思, 他们都有特殊的含义. 一些元字符写在方括号中的时候有一些特殊的意思. 以下是一些元字符的介绍:

元字符	描述
$	从末端开始匹配
^	从开始行开始匹配
\	转义字符,用于匹配一些保留的字符 `[ ] ( ) { } . * + ? ^ $ \
\|	或运算符,匹配符号前或后的字符.
(xyz)	字符集, 匹配与 xyz 完全相等的字符串.
{n,m}	匹配num个大括号之前的字符 (n <= num <= m).
?	标记?之前的字符为可选.
+	匹配>=1个重复的+号前的字符.
*	匹配>=0个重复的在*号之前的字符.
[^ ]	否定的字符种类. 匹配除了方括号里的任意字符
[ ]	字符种类. 匹配方括号内的任意字符.
.	句号匹配任意单个字符除了换行符.

2.1 点运算符.

.是元字符中最简单的例子, .匹配任意单个字符, 但不匹配换行符. 例如, 表达式wrf_d03_20180.\.nc中的第一个 .匹配一个任意字符,该字符前面是wrf_d03_20180,后面是.nc.

import re
text = 'wrf_d03_201805.nc wrf_d03_201806.nc wrf_d03_201807.nc wrf_d03_201810.nc wrf_d03_201806 wrf_d03_201812'
regex = 'wrf_d03_20180.\.nc' # 第一个 点是点运算符, 第二点前面加上\ 是为了反转义，即第二个点只是一个字符，不是点运算符
print(re.findall(regex,text))

['wrf_d03_201805.nc', 'wrf_d03_201806.nc', 'wrf_d03_201807.nc']

2.2 字符集

字符集也叫做字符类. 方括号用来指定一个字符集. 在方括号中使用连字符来指定字符集的范围. 在方括号中的字符集不关心顺序. 例如, 表达式[Ww]rf 匹配 Wrf 和 wrf.

import re
text = 'Wrf666.nchjhjhjhjhffsfsgfergwrf777.ncfjkajawrf888888.nc'
#regex = '[Ww]rf[0-9]{1,9},nc$'
regex = '[Ww]rf[0-9]{3,6}.nc' #花括号匹配num个大括号之前的字符 (n <= num <= m).
print(re.findall(regex,text))

['Wrf666.nc', 'wrf777.nc', 'wrf888888.nc']

2.21否定字符集

一般来说 ^ 表示一个字符串的开头, 但它用在一个方括号内的开头的时候, 它表示这个字符集是否定的.下面例子中的[^We]表示非W 和e,即既不是W,也不是e.

import re
text_list = ['Wrfout_d02_2019080215.nc','wrfout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
       ,'WRFCHEM_d02_2019081213.nc','wrfout_d01_2019080215.nc','erfout_d01_2019080215.nc'] 
#regex = '^wrfout_d0[0-3]_2019[0-9]{6}.nc$'
regex = '[^We]rfout_d0[0-3]_2019*'
for each_file in text_list:
    if re.search(regex,each_file) is not None:
        print(each_file)
        print('匹配项为：',re.findall(regex,each_file))
    else:
        print('%s not match anything' %each_file)

Wrfout_d02_2019080215.nc not match anything
wrfout_d02_2019080615.nc
匹配项为：['wrfout_d02_2019']
wrfout_d03_2019080715.nc
匹配项为：['wrfout_d03_2019']
WRFCHEM_d02_2019081213.nc not match anything
wrfout_d01_2019080215.nc
匹配项为：['wrfout_d01_2019']
erfout_d01_2019080215.nc not match anything

2.3 重复次数

后面跟着元字符 +, * or ? 的, 用来指定匹配子模式的次数. 这些元字符在不同的情况下有着不同的意思.

**2.3.1 *号**

*号匹配在之前的字符出现大于等于0次. 在下面的例子中，对于表达式wrf*out,表示w后面跟个r,r 后面跟0个或者无数次的f，之后再依次跟o、u、t.

'wrffffout_d02_2019080215.nc' 匹配了4个f, 'wrout_d02_2019080615.nc'中，r后面有0个f，所以也被匹配出来了。

import re
text_list =['wrffffout_d02_2019080215.nc','wrout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
       ,'WRFCHEM_d02_2019081213.nc']

regex = 'wrf*out'
for each_file in text_list:
    if re.search(regex,each_file) is not None:
        print(each_file)
        print('匹配项为：',re.findall(regex,each_file))
    else:
        print('%s not match anything' %each_file)

wrffffout_d02_2019080215.nc
匹配项为：['wrffffout']
wrout_d02_2019080615.nc
匹配项为：['wrout']
wrfout_d03_2019080715.nc
匹配项为：['wrfout']
WRFCHEM_d02_2019081213.nc not match anything

2.3.2 +号

+号匹配+号之前的字符出现 >=1 次. 例如表达式c.+t 匹配以首字母c开头以t结尾,中间跟着任意个字符的字符串. 注意与2.3.1中星号的区别，由于'wrout_d02_2019080615.nc'中wr后面没有f,所以没有匹配到。

import re
text_list =['wrffffout_d02_2019080215.nc','wrout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
       ,'WRFCHEM_d02_2019081213.nc']

regex = 'wrf+out'
for each_file in text_list:
    if re.search(regex,each_file) is not None:
        print(each_file)
        print('匹配项为：',re.findall(regex,each_file))
    else:
        print('%s not match anything' %each_file)

wrffffout_d02_2019080215.nc
匹配项为：['wrffffout']
wrout_d02_2019080615.nc not match anything
wrfout_d03_2019080715.nc
匹配项为：['wrfout']
WRFCHEM_d02_2019081213.nc not match anything

2.3.3 ？号

在正则表达式中元字符 ? 标记在符号前面的字符为可选, 即出现 0 或 1 次. 例如, 表达式 [w]?rf 匹配字符串 rf 和 wrf.

text =['wrfout_d02_2019080215.nc','Wrfout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
       ,'WRFCHEM_d02_2019081213.nc','rfout_d02_2019080215.nc']

regex = '[wW]?rfout_d0[1-3].+nc'#
for each_file in text:
    if re.search(regex,each_file) is not None:
        print(each_file)
        print('匹配项为：',re.findall(regex,each_file))
    else:
        print('%s not match anything' %each_file)

wrfout_d02_2019080215.nc
匹配项为：['wrfout_d02_2019080215.nc']
Wrfout_d02_2019080615.nc
匹配项为：['Wrfout_d02_2019080615.nc']
wrfout_d03_2019080715.nc
匹配项为：['wrfout_d03_2019080715.nc']
WRFCHEM_d02_2019081213.nc not match anything
rfout_d02_2019080215.nc
匹配项为：['rfout_d02_2019080215.nc']

2.4{}号

在正则表达式中 {} 是一个量词, 常用来一个或一组字符可以重复出现的次数. 例如, 表达式 [0-9]{4,10} 匹配最少4 位最多10 位 0~9 的数字.[0-9]{2}只匹配两位。

text =['The number was 9.9997 but we rounded it off to 10.0.','Wrfout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
       ,'WRFCHEM_d02_2019081213.nc','rfout_d02_2019080215.nc','WRFCHEM_d02_2019.nc']

regex = 'd[0-9]{2}_+[0-9]{4,10}\.nc'
for each_file in text:

    if re.search(regex,each_file) is not None:
        print(each_file)
        print('匹配项为：',re.findall(regex,each_file))
    else:
        print('%s not match anything' %each_file)

The number was 9.9997 but we rounded it off to 10.0. not match anything
Wrfout_d02_2019080615.nc
匹配项为：['d02_2019080615.nc']
wrfout_d03_2019080715.nc
匹配项为：['d03_2019080715.nc']
WRFCHEM_d02_2019081213.nc
匹配项为：['d02_2019081213.nc']
rfout_d02_2019080215.nc
匹配项为：['d02_2019080215.nc']
WRFCHEM_d02_2019.nc
匹配项为：['d02_2019.nc']

2.5（....）特征标群

特征标群是一组写在 (...) 中的子模式. 例如之前说的 {} 是用来表示前面一个字符出现指定次数. 但如果在 {} 前加入特征标群则表示整个标群内的字符重复 N 次. 例如, 表达式 (ab)* 匹配连续出现 0 或更多个 ab.们还可以在 () 中用或字符 | 表示或.

text =['erfout_d02_2019080615.nc',
       'Wrfout_d02_2019080615.nc',
       'wrfout_d03_2019080715.nc'
       ,'WRFCHEM_d02_2019081213.nc',
       'rfout_d02_2019080215.nc']

regex = '(W|w|e)rf'
for each_file in text:
    if re.search(regex,each_file) is not None:
        print(each_file)
        print('匹配项为：',re.findall(regex,each_file))
    else:
        print('%s not match anything' %each_file)

erfout_d02_2019080615.nc
匹配项为：['e']
Wrfout_d02_2019080615.nc
匹配项为：['W']
wrfout_d03_2019080715.nc
匹配项为：['w']
WRFCHEM_d02_2019081213.nc not match anything
rfout_d02_2019080215.nc not match anything

2.6 |或运算符

或运算符就表示或, 用作判断条件.

text =['The car is parked in the garage.',
       'Wrfout_d02_2019080615.nc',
       'wrfout_d03_2019080715.nc'
       ,'WRFCHEM_d02_2019081213.nc',
       'rfout_d02_2019080215.nc']

regex = 'WRFCHEM|wrf' 
for each_file in text:
    if re.search(regex,each_file) is not None:
        print(each_file)
        print('匹配项为：',re.findall(regex,each_file))
    else:
        print('%s not match anything' %each_file)

The car is parked in the garage. not match anything
Wrfout_d02_2019080615.nc not match anything
wrfout_d03_2019080715.nc
匹配项为：['wrf']
WRFCHEM_d02_2019081213.nc
匹配项为：['WRFCHEM']
rfout_d02_2019080215.nc not match anything

2.7 锚点

在正则表达式中, 想要匹配指定开头或结尾的字符串就要使用到锚点. ^ 指定开头, $ 指定结尾.

^ 用来检查匹配的字符串是否在所匹配字符串的开头.

例如, 在 abc 中使用表达式 ^a 会得到结果 a. 但如果使用 ^b 将匹配不到任何结果. 因为在字符串 abc 中并不是以 b 开头.

同理于 ^ 号, $ 号用来匹配字符是否是最后一个.

例如, (at.)$ 匹配以 at. 结尾的字符串.

2.8简写字符集

正则表达式提供一些常用的字符集简写. 如下:

.     除换行符外的所有字符
\w     匹配所有字母数字, 等同于 [a-zA-Z0-9_]
\W     匹配所有非字母数字, 即符号, 等同于: [^\w]
\d     匹配数字: [0-9]
\D     匹配非数字: [^\d]
\s     匹配所有空格字符, 等同于: [\t\n\f\r\p{Z}]
\S     匹配所有非空格字符: [^\s]
\f     匹配一个换页符
\n     匹配一个换行符
\r     匹配一个回车符
\t     匹配一个制表符
\v     匹配一个垂直制表符
\p     匹配 CR/LF (等同于 \r\n)，用来匹配 DOS 行终止符

回头再看看我们文章开头案例的'\d{4}-\d{2}-\d{2}'，是不是觉得我们的案例so easy 了呢？，这里的\d就是[0-9]的简写形式。

通过正则表达式，我们还能够挑选出wrfout每天15点的数据（wrf.+15.nc），也能挑选出某个年份8月的数据（方法一通百通，就不逐个列举了）。

import re
text_list =['wrffffout_d02_2019080215.nc','wrout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
       ,'WRFCHEM_d02_2019081213.nc']

regex = 'wrf.+15\.nc'
for each_file in text_list:
    if re.search(regex,each_file) is not None:
        print(each_file)
        print('匹配项为：',re.findall(regex,each_file))
    else:
        print('%s not match anything' %each_file)

wrffffout_d02_2019080215.nc
匹配项为：['wrffffout_d02_2019080215.nc']
wrout_d02_2019080615.nc not match anything
wrfout_d03_2019080715.nc
匹配项为：['wrfout_d03_2019080715.nc']
WRFCHEM_d02_2019081213.nc not match anything

References

[1] : https://github.com/ziishaned/learn-regex/blob/master/translations/README-cn.md [2] : https://github.com/deepwindlee/MySQL-with-Python-DATA-MINING/blob/master/0814%E5%AD%A6%E4%B9%A0%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F.ipynb

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-08-14，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法

正则表达式

本文分享自 MeteoAI 微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

编程算法

正则表达式

登录后参与评论

0 条评论

热度