首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >需要修复regex语句才能从excel文件名获得正确的信息。

需要修复regex语句才能从excel文件名获得正确的信息。
EN

Stack Overflow用户
提问于 2021-03-02 09:13:18
回答 1查看 29关注 0票数 2

我有一个excel文件列表,其名称的格式类似。我需要用他们名字中的信息作为熊猫数据栏。我不太熟悉regex,但是我使用google和堆栈溢出来找出我需要做的事情。然而,有一些边缘的情况,我需要帮助找出。

这是我拥有的前40个文件的名称列表,它可以帮助展示我所面临的挑战:

代码语言:javascript
运行
复制
Maker Month Wise Data  of VADAKARA RTO - KL18 , Kerala (2020).xlsx
Maker Month Wise Data  of KATHUA RTO - JK8 , Jammu & Kashmir (2020).xlsx
Maker Month Wise Data  of KANCHEEPURAM RTO - TN21 , Tamil Nadu (2020).xlsx
Maker Month Wise Data  of KANJIRAPPALLY SRTO - KL34 , Kerala (2020).xlsx
Maker Month Wise Data  of PATHANKOT SDM - PB35 , Punjab (2020).xlsx
Maker Month Wise Data  of Chiplun Chiplun Track - MH202 , Maharashtra (2020).xlsx
Maker Month Wise Data  of ZUNHEBOTO DTO - NL6 , Nagaland (2020).xlsx
Maker Month Wise Data  of MAJITHA SDM - PB81 , Punjab (2020).xlsx
Maker Month Wise Data  of Adinath Fitness Center - RJ260 , Rajasthan (2020).xlsx
Maker Month Wise Data  of MUMBAI (EAST) - MH3 , Maharashtra (2020).xlsx
Maker Month Wise Data  of CHIDAMBARAM RTO - TN544 , Tamil Nadu (2020).xlsx
Maker Month Wise Data  of PUDUCHERRY - PY1 , Puducherry (2020).xlsx
Maker Month Wise Data  of RANIPET RTO - TN73 , Tamil Nadu (2020).xlsx
Maker Month Wise Data  of RTA, HISAR - HR39 , Haryana (2020).xlsx
Maker Month Wise Data  of AIZAWL RURAL DTO - MZ9 , Mizoram (2020).xlsx
Maker Month Wise Data  of ANANDPUR SAHIB SDM - PB16 , Punjab (2020).xlsx
Maker Month Wise Data  of PEN (RAIGAD) - MH6 , Maharashtra (2020).xlsx
Maker Month Wise Data  of PEHOWA - HR41 , Haryana (2020).xlsx
Maker Month Wise Data  of AKOLA - MH30 , Maharashtra (2020).xlsx
Maker Month Wise Data  of CANACONA RTO - GA10 , Goa (2020).xlsx
Maker Month Wise Data  of Hooghly RTO - WB15 , West Bengal (2020).xlsx
Maker Month Wise Data  of DEVIKULAM SRTO - KL68 , Kerala (2020).xlsx
Maker Month Wise Data  of KUTTANADU SRTO - KL66 , Kerala (2020).xlsx
Maker Month Wise Data  of CHENNAI (NORTH-EAST) RTO - TN3 , Tamil Nadu (2020).xlsx
Maker Month Wise Data  of RLA SHILLAI - HP85 , Himachal Pradesh (2020).xlsx
Maker Month Wise Data  of Baloda Bazar DTO - CG22 , Chhattisgarh (2020).xlsx
Maker Month Wise Data  of TC OFFICE - STA OFFICE - KL99 , Kerala (2020).xlsx
Maker Month Wise Data  of NANDURBAR - MH39 , Maharashtra (2020).xlsx
Maker Month Wise Data  of KHETRI DTO - RJ53 , Rajasthan (2020).xlsx
Maker Month Wise Data  of AHMEDGARH SDM - PB82 , Punjab (2020).xlsx
Maker Month Wise Data  of Alipurduar RTO - WB69 , West Bengal (2020).xlsx
Maker Month Wise Data  of RLA GOHAR - HP32 , Himachal Pradesh (2020).xlsx
Maker Month Wise Data  of KOLHAPUR - MH9 , Maharashtra (2020).xlsx
Maker Month Wise Data  of SILVASSA - DD1 , UT of DNH and DD (2020).xlsx
Maker Month Wise Data  of MANNARGHAT SRTO - KL50 , Kerala (2020).xlsx
Maker Month Wise Data  of SRIVILLIPUTHUR RTO - TN605 , Tamil Nadu (2020).xlsx
Maker Month Wise Data  of ZONAL OFFICE, SOUTH WEST DELHI,DWARKA - DL9 , Delhi (2020).xlsx
Maker Month Wise Data  of BUDGAM ARTO - JK4 , Jammu & Kashmir (2020).xlsx
Maker Month Wise Data  of Kolar  RTO - KA7 , Karnataka (2020).xlsx
Maker Month Wise Data  of Singtam, East Sikkim - SK8 , Sikkim (2020).xlsx

这是使用regex从这些文件名中提取信息的代码片段:

代码语言:javascript
运行
复制
# Add RTO column - WORKS
rto = re.search('\s\sof\s(.*)\s\-', file_name)
df['RTO'] = rto.group(1)

# Add registration number column - NEEDS TO BE CORRECTED - See match 27
registration_number = re.search('\s\-(.*)\s\,', file_name)
df['Registration Number'] = registration_number.group(1)

# Add state column - NEEDS TO BE CORRECTED - See match 14, 34, 37
state = re.search('\,\s(.*)\s\(', file_name)
df['State'] = state.group(1)

# Add year column - NEEDS TO BE CORRECTED - See match 10, 17, 24, 
year = re.search('\((.*)\)', file_name)
df['Year'] = year.group(1)

RTO的正则表达式似乎是正确的,但是需要为注册号、状态和年份确定某些边缘情况。我在代码注释中突出显示了regex错误行。请告诉我,如果有任何额外的投入,我可以提供。

如果能帮我解决这个问题,我将不胜感激!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-03-02 09:39:43

我想你可以修改一下你以前的解决方案

代码语言:javascript
运行
复制
pattern = r'\s+of\s+(.*?)\s+-\s+(.*?)\s+,\s+(.*?)\s+\((\d{4})\)'
df[['RTO', 'Registration Number', 'State','Year']] = df['Maker'].str.extract(pattern, expand=True)

regex演示

如果注册号码只能包含大写字母和数字,则可以将(.*?)替换为([A-Z0-9]+)并使用\s+of\s+(.*?)\s+-\s+([A-Z0-9]*)\s+,\s+(.*?)\s+\((\d{4})\)

详细信息

  • \s+ -一个或多个空白空间
  • of -一个词of
  • \s+一个或多个空白
  • (.*?) -第1组:除行中断字符以外的任何零个或多个字符尽可能少
  • \s+,\s+ --一个用1+空格括起来的逗号
  • (.*?) -第2组:除行中断字符以外的任何零个或多个字符尽可能少
  • \s+,\s+ --一个用1+空格括起来的逗号
  • (.*?) -第3组:除行中断字符以外的任何零个或多个字符(或者,如果使用[A-Z0-9]+,任何一个或多个大写字母或数字)
  • \s+ - 1+白空间
  • \( -a ( char
  • (\d{4}) -第3组:四位数
  • \) -a ) char.
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66436442

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档