blocks|key|114191|text|问题是您的正则表达式模式包含太多。它包括所有urls。您可以通过使用(?=)来使用lookahead|type|unstyled|depth|inlineStyleRanges|entityRanges|data|114192|试试这个：|114193|re.findall("((www\.%7Chttp://%7Chttps://)(www\.)*.*?(?=(www\.%7Chttp://%7Chttps://%7C$)))",+strings)|code-block|syntax|javascript|114194|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|L|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|M|8|@]|9|@]|A|$G|H]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

blocks|key|2658625|text|您的问题是http://已被接受为url的有效部分。这是因为这里有这个标记：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2658626|[$-_@.&%2B]|code-block|syntax|javascript|2658627|或者更具体地说：|2658628|$-_|2658629|这将匹配从$到_范围内的所有字符，其中包含的字符可能比您预期的要多得多。|2658630|您可以将其更改为[$\-_@.&%2B]，但这会导致问题，因为现在，/字符将不匹配。因此，可以使用[$\-_@.&%2B/]添加它。但是，这将再次导致问题，因为http://example.com/path/topage.htmlhttp将被视为有效匹配。|2658631|最后添加的是添加一个先行检查，以确保您不匹配http://或https://，这恰好是您的正则表达式的第一部分！|2658632|http[s]?://(?:(?!http[s]?://)[a-zA-Z]%7C[0-9]%7C[$\-_@.&%2B/]%7C[!*,]%7C(?:%25[0-9a-fA-F][0-9a-fA-F]))%2B|2658633|测试过的here|2658634|entityMap|0|LINK|mutability|MUTABLE|url|https://regex101.com/r/sF0oM0/1^0|5|7|0|0|0|0|5|1|7|1|0|8|A|W|1|1B|B|24|13|0|M|7|U|8|0|0|4|4|0|0^^$0|@$1|2|3|4|5|6|7|16|8|@$9|17|A|18|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|19|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|1A|8|@]|D|@]|E|$]]|$1|M|3|N|5|H|7|1B|8|@]|D|@]|E|$I|J]]|$1|O|3|P|5|6|7|1C|8|@$9|1D|A|1E|B|C]|$9|1F|A|1G|B|C]]|D|@]|E|$]]|$1|Q|3|R|5|6|7|1H|8|@$9|1I|A|1J|B|C]|$9|1K|A|1L|B|C]|$9|1M|A|1N|B|C]|$9|1O|A|1P|B|C]]|D|@]|E|$]]|$1|S|3|T|5|6|7|1Q|8|@$9|1R|A|1S|B|C]|$9|1T|A|1U|B|C]]|D|@]|E|$]]|$1|U|3|V|5|H|7|1V|8|@]|D|@]|E|$I|J]]|$1|W|3|X|5|6|7|1W|8|@]|D|@$9|1X|A|1Y|1|1Z]]|E|$]]|$1|Y|3|-4|5|6|7|20|8|@]|D|@]|E|$]]]|Z|$10|$5|11|12|13|E|$14|15]]]]

Your problem is that <code>http://</code> is being accepted as a valid part of a url. This is because of this token right here:

<pre><code>[$-_@.&amp;+]
</code></pre>

or more specifically:

<pre><code>$-_
</code></pre>

This matches all characters with the range from <code>$</code> to <code>_</code>, which includes a lot more characters than you probably intended to do.

You can change this to <code>[$\-_@.&amp;+]</code> but this causes problems since now, <code>/</code> characters will not match. So add it by using <code>[$\-_@.&amp;+/]</code>. However, this will again cause problems since <code>http://example.com/path/topage.htmlhttp</code> would be considered a valid match.

The final addition is to add a lookahead to ensure that you are not matching <code>http://</code> or <code>https://</code>, which just so happens to be the first part of your regex!

<pre><code>http[s]?://(?:(?!http[s]?://)[a-zA-Z]|[0-9]|[$\-_@.&amp;+/]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
</code></pre>

tested <a href="https://regex101.com/r/sF0oM0/1" rel="nofollow">here</a>

blocks|key|2658642|text|一个简单的答案，不用太复杂：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2658643|import+re
url_list+=+[]

for+x+in+re.split("http://",+l):
++++url_list.append(re.split("https://",x))

url_list+=+[item+for+sublist+in+url_list+for+item+in+sublist]|code-block|syntax|javascript|2658644|如果您想要将字符串http://和https://追加回urls，请对代码进行适当的更改。希望我能传达出我的想法。|offset|length|style|CODE|2658645|entityMap^0|0|0|9|7|H|8|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@$I|R|J|S|K|L]|$I|T|J|U|K|L]]|9|@]|A|$]]|$1|M|3|-4|5|6|7|V|8|@]|9|@]|A|$]]]|N|$]]

A simple answer without getting into much complication:

<pre><code>import re
url_list = []

for x in re.split("http://", l):
 url_list.append(re.split("https://",x))

url_list = [item for sublist in url_list for item in sublist]
</code></pre>

In case you want to append the string <code>http://</code> and <code>https://</code> back to the urls, do appropriate changes to the code. Hope i convey the idea.

blocks|key|111599|text|这是我的|type|unstyled|depth|inlineStyleRanges|entityRanges|data|111600|(r’http[s]?://[a-zA-Z]{3}\.[a-zA-Z0-9]%2B\.[a-zA-Z]%2B')|code-block|syntax|javascript|111601|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

here's mine

<pre><code>(r’http[s]?://[a-zA-Z]{3}\.[a-zA-Z0-9]+\.[a-zA-Z]+')
</code></pre>

I have a string like this 

<blockquote>
 <code>http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/</code>
</blockquote>

I would like to extract all url / webaddress into a Array. for example 

<code>urls = ['http://example.com/path/topage.html','http://twitter.com/p/xyan',.....]</code>

Here is my approach which didn't work. 

<pre><code>import re
strings = "http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/"
links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&amp;+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strings)

print links
// result always same as strings 
</code></pre>

Regex to extract all urls from string

我有一个这样的字符串 http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/我想提取所有的网址/网页地址...

问从字符串中提取所有urls的正则表达式
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从字符串中提取所有urls的正则表达式EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从字符串中提取所有urls的正则表达式
EN