我有一个熊猫数据框,看起来像这样:
Name Address
Alan 23 Belby road, home near me 71234
Tom PA23 6NH brickby avenue
Solty 7 solty road 7123-234
Ben Nowhere road 713456 Belgium
我想获得post代码,结果数据帧如下所示:
Name Address Postcode
Alan 23 Belby road, home near me 71234 71234
Tom PA23 6NH brickby avenue PA23 6NH
Solty 7 solty road 7123-234 7123-234
Ben Nowhere road 713456 Belgium 713456
我看了Python, Regular Expression Postcode search和python - get zipcode from full address的帖子
不清楚如何继续。
发布于 2021-04-26 05:31:09
可以在用|
分隔的re表达式中指定每个捕获组
Extract all模式匹配到单个列中(请参见。Multiple Pattern using Regex in Pandas)
然后尝试使用bfill将所有匹配项强制放入第一列(请参见。How to collapse multiple columns into one in pandas)
然后合并回原始数据集。
import pandas as pd
postcode_re = r'([Gg][Ii][Rr] 0[Aa]{2})|' \
r'((([A-Za-z][0-9]{1,2})|' \
r'(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|' \
r'(([A-Za-z][0-9][A-Za-z])|' \
r'([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})|' \
r'(\d{5}\-?\d{0,4})|' \
r'(\d{4}\-?\d{0,3})'
df = pd.DataFrame({'Name': {0: 'Alan', 1: 'Tom',
2: 'Solty', 3: 'Ben',
4: 'Mary', 5: 'Mike'},
'Address': {0: 'PA23 6NH brickby avenue',
1: '818 mention 560100',
2: 'calculate AB24 EFT',
3: '818 where 560100',
4: 'Nowhere road 713456 Belgium',
5: '7 solty road 7123-234'}})
df = df.merge(df['Address']
.str
.extractall(postcode_re)
.bfill(axis=1)[0]
.droplevel(level=1)
.rename('Postcode'),
left_index=True,
right_index=True,
how='left')
print(df.to_string())
输出:
Name Address Postcode
0 Alan PA23 6NH brickby avenue PA23 6NH
1 Tom 818 mention 560100 560100
2 Solty calculate AB24 EFT NaN
3 Ben 818 where 560100 560100
4 Mary Nowhere road 713456 Belgium 713456
5 Mike 7 solty road 7123-234 7123-234
如果您需要更多post代码正则表达式,请参阅postal-codes。
https://stackoverflow.com/questions/67258240
复制相似问题