文章/答案/技术大牛

发布

问修剪大日志文件
EN

Stack Overflow用户

提问于 2012-03-11 16:25:18

回答 4查看 2.1K关注 0票数 7

我为几个java应用程序执行性能测试。应用程序在测试期间会产生非常大的日志文件(可以是7-10 GB)。我需要在特定的日期和时间之间修剪这些日志文件。目前，我使用的是python脚本，它解析datetime python对象中的日志时间戳，并只打印匹配的字符串。但是这个解决方案非常慢。5 GB的日志被解析了大约25分钟，显然日志文件中的条目是按顺序的，我不需要逐行读取所有的文件。我考虑从头到尾读取文件，直到条件匹配，并在匹配的行数之间打印文件。但我不知道如何从后向读取文件，而不将其下载到内存中。

你能给我推荐一些适合这种情况的解决方案吗？

以下是python脚本的一部分：

      lfmt = '%Y-%m-%d %H:%M:%S'
      file = open(filename, 'rU')
      normal_line = ''
      for line in file:
        if line[0] == '[':
          ltimestamp = datetime.strptime(line[1:20], lfmt)

          if ltimestamp >= str and ltimestamp <= end:
            normal_line = 'True'
        else:
          normal_line = ''

      if normal_line:
        print line,

python

performance

bash

sed

logging

回答 4

Stack Overflow用户

回答已采纳

发布于 2012-03-11 17:19:24

由于数据是连续的，如果感兴趣区域的开始和结束都在文件的开头附近，那么从文件的结尾读取(找到匹配的结束点)仍然是一个糟糕的解决方案！

我已经写了一些代码，可以根据你的需要快速找到起点和终点，这种方法被称为binary search，类似于clasic called的“更高或更低”的猜谜游戏！

该脚本读取lower_bounds和upper_bounds (最初是SOF和EOF)之间的一条试验行，并检查匹配条件。如果查找的行是较早的，则它通过读取lower_bound和前一次读取试验之间的一条线来再次猜测(如果它更高，则它在猜测和上限之间拆分)。所以你在上下限之间不断迭代--这会产生最快的“平均”解。

这应该是一个真正快速的解决方案(记录到行数的基数2！)。例如，在最坏的情况下(从1000行中找到999行)，使用二进制搜索只需要9行读取！(从十亿行开始只需要30行……)

以下代码的假设：

每一行都以时间信息开头。
时间是唯一的-如果不是，当发现匹配时，您必须根据需要(如果需要)向后或向前检查，以包括或排除所有具有匹配时间的条目。
有趣的是，这是一个递归函数，因此文件的行数限制为2**1000 (幸运的是，这允许使用相当大的文件...)

进一步：

如果愿意，这可以调整为以任意块读取，而不是按行读取。正如J.F. Sebastian.
In建议的那样，我最初的answerI建议使用这种方法，但使用linecache.getline，虽然这可能不适合大文件，因为它将整个文件读取到内存中(因此file.seek()更好)，这要归功于TerryE和J.F. Sebastian指出这一点。

导入日期时间

def match(line):
    lfmt = '%Y-%m-%d %H:%M:%S'
    if line[0] == '[':
        return datetime.datetime.strptime(line[1:20], lfmt)

def retrieve_test_line(position):
    file.seek(position,0)
    file.readline()  # avoids reading partial line, which will mess up match attempt
    new_position = file.tell() # gets start of line position
    return file.readline(), new_position

def check_lower_bound(position):
    file.seek(position,0)
    new_position = file.tell() # gets start of line position
    return file.readline(), new_position

def find_line(target, lower_bound, upper_bound):
    trial = int((lower_bound + upper_bound) /2)
    inspection_text, position = retrieve_test_line(trial)
    if position == upper_bound:
        text, position = check_lower_bound(lower_bound)
        if match(text) == target:
            return position
        return # no match for target within range
    matched_position = match(inspection_text)
    if matched_position == target:
        return position
    elif matched_position < target:
        return find_line(target, position, upper_bound)
    elif matched_position > target:
        return find_line(target, lower_bound, position)
    else:
        return # no match for target within range

lfmt = '%Y-%m-%d %H:%M:%S'
# start_target =  # first line you are trying to find:
start_target =  datetime.datetime.strptime("2012-02-01 13:10:00", lfmt)
# end_target =  # last line you are trying to find:
end_target =  datetime.datetime.strptime("2012-02-01 13:39:00", lfmt)
file = open("log_file.txt","r")
lower_bound = 0
file.seek(0,2) # find upper bound
upper_bound = file.tell()

sequence_start = find_line(start_target, lower_bound, upper_bound)

if sequence_start or sequence_start == 0: #allow for starting at zero - corner case
    sequence_end = find_line(end_target, sequence_start, upper_bound)
    if not sequence_end:
        print "start_target match: ", sequence_start
        print "end match is not present in the current file"
else:
    print "start match is not present in the current file"

if (sequence_start or sequence_start == 0) and sequence_end:
    print "start_target match: ", sequence_start
    print "end_target match: ", sequence_end
    print
    print start_target, 'target'
    file.seek(sequence_start,0)
    print file.readline()
    print end_target, 'target'
    file.seek(sequence_end,0)
    print file.readline()

票数 5

Stack Overflow用户

发布于 2012-03-12 01:10:04

5 GB日志解析约25分钟

它是~3MB/s。即使是Python can do much better (~500MB/s for wc-l.py)中的顺序O(n)扫描，也应该只受I/O的限制。

要对文件执行二进制搜索，可以调整使用固定记录的FileSearcher，使用类似于tail -n implementation in Python的方法(扫描'\n'就是O(n) )。

为了避免O(n) (如果日期范围只选择了日志的一小部分)，您可以使用使用大的固定区块的近似搜索，并允许由于某些记录位于区块边界而丢失它们，例如，将未修改的FileSearcher与record_size=1MB和自定义Query类一起使用：

class Query(object):

    def __init__(self, query):
        self.query = query # e.g., '2012-01-01'

    def __lt__(self, chunk):
        # assume line starts with a date; find the start of line
        i = chunk.find('\n')
        # assert '\n' in chunk and len(chunk) > (len(self.query) + i)
        # e.g., '2012-01-01' < '2012-03-01'
        return self.query < chunk[i+1:i+1+len(self.query)]

考虑到日期范围可以跨越多个块，您可以修改FileSearcher.__getitem__以返回(filepos, chunk)，并搜索两次(bisect_left()，bisect_right())以找到近似的filepos_mindate，filepos_maxdate。之后，您可以围绕给定的文件位置执行线性搜索(例如，使用tail -n方法)，以找到确切的第一条和最后一条日志记录。

票数 2

Stack Overflow用户

发布于 2012-03-11 16:37:45

7到10 GB是一个很大的数据量。如果我必须分析这种类型的数据，我会让应用程序记录到数据库中，或者将日志文件上传到数据库中。然后，你可以在数据库上高效地进行大量的分析。如果您正在使用像Log4J这样的标准日志记录工具，那么记录到数据库应该是非常简单的。只是提出了另一个解决方案。

关于数据库日志的更多信息，你可以参考这篇文章：

A good database log appender for Java?

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/9653507

复制

相似问题

问修剪大日志文件
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问修剪大日志文件EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问修剪大日志文件
EN