blocks|key|1833763|text|请您将您的regex替换为下面查找关键条款以及它们之间的任何内容的正则表达式，并告诉我您现在收到的错误是什么？|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1833764|a=re.findall(r"Sent:(.*?) ",+soup.text)[0]|code-block|syntax|javascript|1833765|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Can you please replace your regex with the one below that looks for the key terms and then anything between them and tell me what error if any you are now receiving?

<pre><code>a=re.findall(r"Sent:(.*?)&lt;br&gt;", soup.text)[0]
</code></pre>

blocks|key|676817|text|尝尝这个。它还考虑到如果 标记包含斜杠。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|676818|/Sent:(.*?)<\/*br>/|code-block|syntax|javascript|676819|entityMap^0|C|4|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|P|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|Q|8|@]|D|@]|E|$]]]|L|$]]

Try this. It also takes into consideration if the <code>&lt;br&gt;</code> tag contains a slash. 

<pre><code>/Sent:(.*?)&lt;\/*br&gt;/
</code></pre>

blocks|key|115508|text|你不需要普通的斜杠逃跑：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|115509|a+=+re.findall(r"Sent:(.*?) ",+soup.text)[0]|code-block|syntax|javascript|115510|话虽如此，在尝试从输出中获取值之前，您可能应该检查输出(或者至少使用try/+get+)。|115511|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

You don't need the usual slash escapes:

<pre><code>a = re.findall(r"Sent:(.*?)&lt;br&gt;", soup.text)[0]
</code></pre>

That being said, you should probably check for the output (or at least use try/except) before trying to get a value from it.

blocks|key|1833827|text|问题不是真正的regex，而是BeautifulSoup解析HTML+(毕竟是它的工作)并更改其内容的事实。例如，您的 将转换为 。另一点:+soup.text删除了所有的标记，这样你的正则表达式就不能再工作了。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1833828|更清楚的是，尝试这个脚本：|1833829|from+bs4+import+*
import+re
from+dateutil+import+parser

pattern+=+re.compile(r'Sent:(.%2B?)(?= )')

with+open("myfile.html",+'r')+as+f:
++++++++html+=+f.read()
++++++++print("html:+",+html)
++++++++soup+=+BeautifulSoup(html,+'lxml')
++++++++print("soup.text:+",+soup.text)
++++++++print("str(soup):+",+str(soup))
++++++++a+=+pattern.findall(str(soup))[0]
++++++++print("pattern+extraction:+",+a)|code-block|syntax|javascript|1833830|对于第二部分:由于日期字符串的形式不正确(因为初始的 )，您应该添加参数fuzzy=True，就像在日期文件l中解释的那样。|1833831|d+=+parser.parse(a,+fuzzy=True)
print("year:",+d.year)
print("month:",+d.month)
print("day:",+d.day)|1833832|另一个解决方案是使用更精确的正则表达式。例如：|1833833|pattern+=+re.compile(r'Sent:(.%2B?)(?= )')|1833834|entityMap|0|LINK|mutability|MUTABLE|url|https://dateutil.readthedocs.io/en/stable/parser.html^0|1N|4|1V|5|0|0|0|Q|5|14|A|1I|5|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|12|8|@$9|13|A|14|B|C]|$9|15|A|16|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|17|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|18|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|19|8|@$9|1A|A|1B|B|C]|$9|1C|A|1D|B|C]]|D|@$9|1E|A|1F|1|1G]]|E|$]]|$1|O|3|P|5|J|7|1H|8|@]|D|@]|E|$K|L]]|$1|Q|3|R|5|6|7|1I|8|@]|D|@]|E|$]]|$1|S|3|T|5|J|7|1J|8|@]|D|@]|E|$K|L]]|$1|U|3|-4|5|6|7|1K|8|@]|D|@]|E|$]]]|V|$W|$5|X|Y|Z|E|$10|11]]]]

The problem is not really your regex, but the fact that BeautifulSoup parse the HTML (its job after all) and change its content. For example, your <code>&lt;br&gt;</code> will be transformed to <code>&lt;br/&gt;</code>. Another point : soup.text erases all the tags, so your regex won't work anymore.

It will be more clear trying this script :

<pre><code>from bs4 import *
import re
from dateutil import parser

pattern = re.compile(r'Sent:(.+?)(?=&lt;br/&gt;)')

with open("myfile.html", 'r') as f:
 html = f.read()
 print("html: ", html)
 soup = BeautifulSoup(html, 'lxml')
 print("soup.text: ", soup.text)
 print("str(soup): ", str(soup))
 a = pattern.findall(str(soup))[0]
 print("pattern extraction: ", a)
</code></pre>

For the second part : since your date string is not formally correct (because of the initial <code>&lt;br/&gt;</code>), you should add the parameter <code>fuzzy=True</code>, as its explained in the <a href="https://dateutil.readthedocs.io/en/stable/parser.html" rel="nofollow noreferrer">documentation of dateutil</a>.

<pre><code>d = parser.parse(a, fuzzy=True)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
</code></pre>

Another solution would be to use a more precise regex. For example : 

<pre><code>pattern = re.compile(r'Sent:&lt;/b&gt;(.+?)(?=&lt;br/&gt;)')
</code></pre>

I am parsing an HTML file and would like to match everything between two sequences of characters: <code>Sent:</code> and the <code>&lt;br&gt;</code> tag.

I have seen several very similar questions and tried all of their methods and none have worked for me, probably because I'm a novice and am doing something very simple incorrectly. 

Here's my relevant code:

<pre><code>for filename in os.listdir(path): #capture email year, month, day
 file_path = os.path.join(path, filename)
 if os.path.isfile(file_path):
 with open(file_path, 'r') as f:
 html = f.read()
 soup = BeautifulSoup(html, 'html.parser')
 a = re.findall(r'Sent:/.+?(?=&lt;br&gt;)/', soup.text)[0]
 #a = re.findall(r'Sent:(.*)', soup.text)[0]
 print(a)
 d = parser.parse(a)
 print("year:", d.year)
 print("month:", d.month)
 print("day:", d.day)
</code></pre>

and I've also tried these for my RegEx: <code>a = re.findall(r'Sent:/^(.*?)&lt;br&gt;/', soup.text)[0]</code> and <code>a = re.findall(r'Sent:/^[^&lt;br&gt;]*/', soup.text)[0]</code> 

But I keep getting the error <code>list index out of range</code>.... but even when I remove the <code>[0]</code> I get the error <code>AttributeError: 'list' object has no attribute 'read'</code> on the line <code>d = parser.parse(a)</code>.... with only <code>[]</code> printed as a result of <code>print(a)</code>

Here's the relevant block of HTML:

<pre><code>&lt;b&gt;Sent:&lt;/b&gt; Friday, June 14, 2013 12:07 PM&lt;br&gt;&lt;b&gt;To:&lt;/b&gt; David Leveille&lt;br&gt;&lt;b&gt;Subject:&lt;/b&gt; 
</code></pre>

Why my regex doesn't work with BeautifulSoup?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在解析一个HTML文件，并希望匹配两个字符序列之间的所有内容：Sent:和 标记。我见过几个非常相似的问题，并尝试了他们的所有方法，但没有一个对我有用，可能是因为我是个新手，做一些非常简单的事情是错误的。以下是我的相关代码：for filename in os.listdir(path): #capture ...

问为什么我的regex不适用于BeautifulSoup？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么我的regex不适用于BeautifulSoup？EN