文章/答案/技术大牛

发布

社区首页 >问答首页 >当给定起始字符时，搜索速度较慢是违反直觉的

问当给定起始字符时，搜索速度较慢是违反直觉的
EN

Stack Overflow用户

提问于 2010-09-18 19:48:28

回答 4查看 1.9K关注 0票数 5

我已经编写了一个Python实用程序来扫描日志文件中的已知错误模式。

我试图通过向regex引擎提供额外的模式信息来加快搜索速度。例如，我不仅要查找带有gold的行，还要求这样的行必须以下划线开头，因此：^_.*gold而不是gold。

由于99%的行不是以下划线开头，因此我期望获得很大的性能回报，因为正则表达式引擎可能会中止读取一个字符之后的行。我很惊讶地发现了另一种方式。

下面的程序说明了这个问题：

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"^_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = re.search(p, line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

我试着回顾一下sre_compile.py以寻求解释，但它的代码对我来说太复杂了。

观察到的性能是否可以解释为，行字符开始的包含将正则表达式引擎操作从简单的子串op转换为更复杂的回溯状态机op？从而超过了任何好处，比如在第一个字符之后中止搜索？

考虑到这一点，我试着将线的长度乘以x8，期望线搜索的开始会大放异彩，但结果差距只会变得更大(22秒Vs 6秒)。

我很困惑:哦，我是不是漏掉了什么？

regex

performance

python

回答 4

Stack Overflow用户

回答已采纳

发布于 2010-09-18 20:53:15

您实际上做错了两件事:如果您希望在字符串的开头也使用match not search.，那么不要使用re.match( pattern, line)，而是编译模式并使用pattern.match(line)。

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = p.match(line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

您将看到您现在有了预期的行为-这两种模式花费的时间完全相同。

票数 2

Stack Overflow用户

发布于 2010-09-18 19:51:29

怎么样

if line[0] == "_" and "gold" in line:
   print "Yup, it starts with an underscore"
else:
   print "Nope it doesn't"

说真的，不要过度使用正则表达式

票数 2

Stack Overflow用户

发布于 2010-09-18 21:03:02

有趣的观察！我已经用过它了。我的猜测是，regexp引擎将扫描整个字符串以查找下划线，并在找到匹配项后将其与开头的一行进行匹配。这可能与使用re.MULTILINE时的统一行为有关

如果您使用re.match而不是re.search作为下划线模式，则两者似乎都是一样快的，即

def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_.*") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    start = time()
    for i in xrange(1000*1000):
        match = re.match(p1, line)
    end = time() 
    print 'Elapsed: ' + str(end-start) 
    start = time()
    for i in xrange(1000*1000):
        match = re.search(p2, line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

在这种情况下，match将要求match在字符串的开头开始匹配。

此外，请注意，以下预编译模式的使用似乎更快：

for p in patterns:
    start = time()
    for i in xrange(1000*1000):
        match = p.search(line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

但速度上的差异仍然存在。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/3741581

复制

相似问题

问当给定起始字符时，搜索速度较慢是违反直觉的
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当给定起始字符时，搜索速度较慢是违反直觉的EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问当给定起始字符时，搜索速度较慢是违反直觉的
EN