# 正则表达式

```f = open('xxx.log.txt') #
print(log)```
```'2018-01-16 09：14：35   reading  EC DATA\n
2018-01-16 10：17：37    reprocess the EC DATA\n
2018-01-17 18：18：38    put into WRF,\n
2018-01-22  16：17：37   extract the grid data to nearest station, merge with actual data, save to Mysql database \n
2018-01-24 17：14：39    extract the data from Mysql and put the station data to CNN-LSTM model\n
2018-01-24 22：12：39    training the  CNN-LSTM model\n
2018-01-25 17：09：22    tuning\n
2018-01-25 17：09：22    save the best model\n
2018-01-26 19：23：55    predict  the wind speed\n
2018-01-27 06：09：45    save and evaluate\n\n\n'```

log中的日期格式为yyyy-mm-dd，如果甲方爸爸突然要求我们把日期全部改成mm/dd/yyyy，我们应该如何是好？

```import re
pattern = r'\d{4}-\d{2}-\d{2}'
print(re.findall(pattern,log))```
```['2018-01-16',
'2018-01-16',
'2018-01-17',
'2018-01-22',
'2018-01-24',
'2018-01-24',
'2018-01-25',
'2018-01-25',
'2018-01-26',
'2018-01-27']```

```pattern_ed=r'(\d{4})-(\d{2})-(\d{2})'
sub_order =r'\2/\3/\1' # 重新排序
print(re.sub(pattern_ed,sub_order,log))```
```'01/16/2018 09：14：35   reading  EC DATA\n
01/16/2018 10：17：37   reprocess the EC DATA\n01/17/2018 18：18：38   put into WRF,\n
01/22/2018  16：17：37  extract the grid data to nearest station, merge with actual data, saved at Mysql database \n
01/24/2018 17：14：39   extract the data fro Mysql and put the station data to CNN-LSTM model\n
01/24/2018 22：12：39   training the  CNN-LSTM model\n
01/25/2018 17：09：22   tuning\n
01/25/2018 17：09：22   save the best model\n
01/26/2018 19：23：55   predict  the wind speed\n
01/27/2018 06：09：45   save and plot\n\n\n'```

```pattern_ed=r'(?P<day>\d{4})-(?P<month>\d{2})-(?P<year>\d{2})'
sub_order = r'\g<month>/\g<day>/\g<year>'
print(re.sub(pattern_ed,sub_order,log))```

### 1.基本匹配

```import re
text = 'WRF_d03_hunan_20190608_16:00:00'
regex_1 = 'd03'
regex_2 = 'D03'
print('匹配出：',re.findall(regex_1,text))
print('匹配出：',re.findall(regex_2,text))```
```匹配出：['d03']

\$

^

\

|

(xyz)

{n,m}

?

+

*

[^ ]

[ ]

.

### 2.1 点运算符.

.是元字符中最简单的例子, .匹配任意单个字符, 但不匹配换行符. 例如, 表达式wrf_d03_20180.\.nc中的第一个 .匹配一个任意字符,该字符前面是wrf_d03_20180,后面是.nc.

```import re
text = 'wrf_d03_201805.nc wrf_d03_201806.nc wrf_d03_201807.nc wrf_d03_201810.nc wrf_d03_201806 wrf_d03_201812'
regex = 'wrf_d03_20180.\.nc' # 第一个 点是点运算符, 第二点前面加上\ 是为了反转义，即第二个点只是一个字符，不是点运算符
print(re.findall(regex,text))```
`['wrf_d03_201805.nc', 'wrf_d03_201806.nc', 'wrf_d03_201807.nc']`

### 2.2 字符集

```import re
text = 'Wrf666.nchjhjhjhjhffsfsgfergwrf777.ncfjkajawrf888888.nc'
#regex = '[Ww]rf[0-9]{1,9},nc\$'
regex = '[Ww]rf[0-9]{3,6}.nc' #花括号匹配num个大括号之前的字符 (n <= num <= m).
print(re.findall(regex,text))```
`['Wrf666.nc', 'wrf777.nc', 'wrf888888.nc']`

### 2.21否定字符集

```import re
text_list = ['Wrfout_d02_2019080215.nc','wrfout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
,'WRFCHEM_d02_2019081213.nc','wrfout_d01_2019080215.nc','erfout_d01_2019080215.nc']
#regex = '^wrfout_d0[0-3]_2019[0-9]{6}.nc\$'
regex = '[^We]rfout_d0[0-3]_2019*'
for each_file in text_list:
if re.search(regex,each_file) is not None:
print(each_file)
print('匹配项为：',re.findall(regex,each_file))
else:
print('%s not match anything' %each_file)```
```Wrfout_d02_2019080215.nc not match anything
wrfout_d02_2019080615.nc

wrfout_d03_2019080715.nc

WRFCHEM_d02_2019081213.nc not match anything
wrfout_d01_2019080215.nc

erfout_d01_2019080215.nc not match anything```

### 2.3.1 *号

*号匹配 在之前的字符出现大于等于0次. 在下面的例子中，对于表达式wrf*out,表示w后面跟个r,r 后面跟0个或者无数次的f，之后再依次跟o、u、t.

'wrffffout_d02_2019080215.nc' 匹配了4个f, 'wrout_d02_2019080615.nc'中，r后面有0个f，所以也被匹配出来了。

```import re
text_list =['wrffffout_d02_2019080215.nc','wrout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
,'WRFCHEM_d02_2019081213.nc']

regex = 'wrf*out'
for each_file in text_list:
if re.search(regex,each_file) is not None:
print(each_file)
print('匹配项为：',re.findall(regex,each_file))
else:
print('%s not match anything' %each_file)```
```wrffffout_d02_2019080215.nc

wrout_d02_2019080615.nc

wrfout_d03_2019080715.nc

WRFCHEM_d02_2019081213.nc not match anything```

### 2.3.2 +号

+号匹配+号之前的字符出现 >=1 次. 例如表达式c.+t 匹配以首字母c开头以t结尾,中间跟着任意个字符的字符串. 注意与2.3.1中星号的区别，由于'wrout_d02_2019080615.nc'中wr后面没有f,所以没有匹配到。

```import re
text_list =['wrffffout_d02_2019080215.nc','wrout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
,'WRFCHEM_d02_2019081213.nc']

regex = 'wrf+out'
for each_file in text_list:
if re.search(regex,each_file) is not None:
print(each_file)
print('匹配项为：',re.findall(regex,each_file))
else:
print('%s not match anything' %each_file)```
```wrffffout_d02_2019080215.nc

wrout_d02_2019080615.nc not match anything
wrfout_d03_2019080715.nc

WRFCHEM_d02_2019081213.nc not match anything```

### 2.3.3 ？号

```text =['wrfout_d02_2019080215.nc','Wrfout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
,'WRFCHEM_d02_2019081213.nc','rfout_d02_2019080215.nc']

regex = '[wW]?rfout_d0[1-3].+nc'#
for each_file in text:
if re.search(regex,each_file) is not None:
print(each_file)
print('匹配项为：',re.findall(regex,each_file))
else:
print('%s not match anything' %each_file)```
```wrfout_d02_2019080215.nc

Wrfout_d02_2019080615.nc

wrfout_d03_2019080715.nc

WRFCHEM_d02_2019081213.nc not match anything
rfout_d02_2019080215.nc

### 2.4{}号

```text =['The number was 9.9997 but we rounded it off to 10.0.','Wrfout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
,'WRFCHEM_d02_2019081213.nc','rfout_d02_2019080215.nc','WRFCHEM_d02_2019.nc']

regex = 'd[0-9]{2}_+[0-9]{4,10}\.nc'
for each_file in text:

if re.search(regex,each_file) is not None:
print(each_file)
print('匹配项为：',re.findall(regex,each_file))
else:
print('%s not match anything' %each_file)
```
```The number was 9.9997 but we rounded it off to 10.0. not match anything
Wrfout_d02_2019080615.nc

wrfout_d03_2019080715.nc

WRFCHEM_d02_2019081213.nc

rfout_d02_2019080215.nc

WRFCHEM_d02_2019.nc

### 2.5（....）特征标群

```text =['erfout_d02_2019080615.nc',
'Wrfout_d02_2019080615.nc',
'wrfout_d03_2019080715.nc'
,'WRFCHEM_d02_2019081213.nc',
'rfout_d02_2019080215.nc']

regex = '(W|w|e)rf'
for each_file in text:
if re.search(regex,each_file) is not None:
print(each_file)
print('匹配项为：',re.findall(regex,each_file))
else:
print('%s not match anything' %each_file)
```
```erfout_d02_2019080615.nc

Wrfout_d02_2019080615.nc

wrfout_d03_2019080715.nc

WRFCHEM_d02_2019081213.nc not match anything
rfout_d02_2019080215.nc not match anything```

### 2.6 |或运算符

```text =['The car is parked in the garage.',
'Wrfout_d02_2019080615.nc',
'wrfout_d03_2019080715.nc'
,'WRFCHEM_d02_2019081213.nc',
'rfout_d02_2019080215.nc']

regex = 'WRFCHEM|wrf'
for each_file in text:
if re.search(regex,each_file) is not None:
print(each_file)
print('匹配项为：',re.findall(regex,each_file))
else:
print('%s not match anything' %each_file)
```
```The car is parked in the garage. not match anything
Wrfout_d02_2019080615.nc not match anything
wrfout_d03_2019080715.nc

WRFCHEM_d02_2019081213.nc

rfout_d02_2019080215.nc not match anything
```

### 2.7 锚点

^ 用来检查匹配的字符串是否在所匹配字符串的开头.

### 2.8简写字符集

```.     除换行符外的所有字符
\w     匹配所有字母数字, 等同于 [a-zA-Z0-9_]
\W     匹配所有非字母数字, 即符号, 等同于: [^\w]
\d     匹配数字: [0-9]
\D     匹配非数字: [^\d]
\s     匹配所有空格字符, 等同于: [\t\n\f\r\p{Z}]
\S     匹配所有非空格字符: [^\s]
\f     匹配一个换页符
\n     匹配一个换行符
\r     匹配一个回车符
\t     匹配一个制表符
\v     匹配一个垂直制表符
\p     匹配 CR/LF (等同于 \r\n)，用来匹配 DOS 行终止符```

```import re
text_list =['wrffffout_d02_2019080215.nc','wrout_d02_2019080615.nc','wrfout_d03_2019080715.nc'
,'WRFCHEM_d02_2019081213.nc']

regex = 'wrf.+15\.nc'
for each_file in text_list:
if re.search(regex,each_file) is not None:
print(each_file)
print('匹配项为：',re.findall(regex,each_file))
else:
print('%s not match anything' %each_file)```
```wrffffout_d02_2019080215.nc

wrout_d02_2019080615.nc not match anything
wrfout_d03_2019080715.nc

WRFCHEM_d02_2019081213.nc not match anything```

### References

`[1]` : https://github.com/ziishaned/learn-regex/blob/master/translations/README-cn.md `[2]` : https://github.com/deepwindlee/MySQL-with-Python-DATA-MINING/blob/master/0814%E5%AD%A6%E4%B9%A0%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F.ipynb

0 条评论

• ### 特征工程自动化之FeatureTools

特征工程是指以已有的数据为基础，根据专业领域的知识和经验，构造新的特征，获取高效准确的模型的过程。该过程是机器学习的关键，大部分工作需要依靠人力，耗费时间和精力...

• ### Python简单高效的可视化神器——Seaborn

前面我们已经介绍了matplotlib的一些基础和进阶的操作，相信大家已经掌握了。没有掌握的同学快回去学一学！

• ### 数据处理利器pandas入门

想入门 Pandas，那么首先需要了解Pandas中的数据结构。因为Pandas中数据操作依赖于数据结构对象。Pandas中最常用的数据结构是 Series ...

• ### 前端技巧——性能优化篇

前端发展至此，前端性能随之变成了一个很有意思的话题。从入门级别的初级工程师，到高级别的专家，都离不开性能问题。那么前端性能该如何理解呢？

• ### Tomcat实战-调优方案

Tomcat的默认配置，性能并不是最优的，我们可以通过优化tomcat以此来提高网站的并发能力。提高Tomcat的性能可以分为两个方向。

• ### LAN、WAN、WLAN的区别

LAN LAN，全称Local Area Network，中文名叫做局域网。 顾名思义，LAN是指在某一区域内由多台计算机互联成的计算机组。一般是方圆几千米以内...

• ### 蚂蚁金服面经（3+4）

【每日一语】我和这个世界不熟。这并非是我撕裂的原因。我依旧有很多完整，至少我要成全我自己。──北岛《我和这个世界不熟》

• ### ReentrantLock非公平锁与公平锁的实现

ReentrantLock是根据传入的参数来决定是否使用公平锁，默认使用非公平锁：

MeteoAI