问使用Python优化搜索两个文本文件并根据第三个文本文件进行输出
EN

Stack Overflow用户

提问于 2018-07-20 05:11:16

回答 1查看 20关注 0票数 0

我遇到了一个python函数的性能问题，我使用该函数加载两个5+ GB制表符分隔的txt文件，这两个文件具有不同的值，格式相同，并使用第三个文本文件作为键，以确定应该保留哪些值以供输出。如果可能的话，我需要一些帮助来提高速度。

代码如下：

def rchfile():
# there are 24752 text lines per stress period, 520 columns, 476 rows
# there are 52 lines per MODFLOW model row
lst = []
out = []
tcel = 0
end_loop_break = False

# key file that will set which file values to use. If cell address is not present or value of cellid = 1 use
# baseline.csv, otherwise use test_p97 file.
with open('input/nrd_cells.csv') as csvfile:
    reader = csv.reader(csvfile)
    for item in reader:
        lst.append([int(item[0]), int(item[1])])

# two files that are used for data
with open('input/test_baseline.rch', 'r') as b, open('input/test_p97.rch', 'r') as c:
    for x in range(3):  # skip the first 3 lines that are the file header
        b.readline()
        c.readline()

    while True:  # loop until end of file, this should loop here 1,025 times
        if end_loop_break == True: break
        for x in range(2):  # skip the first 2 lines that are the stress period header
            b.readline()
            c.readline()

        for rw in range(1, 477):
            if end_loop_break == True: break

            for cl in range(52):
                # read both files at the same time to get the different data and split the 10 values in the row
                b_row = b.readline().split()
                c_row = c.readline().split()

                if not b_row:
                    end_loop_break == True
                    break

                for x in range(1, 11):
                    # search for the cell address in the key file to find which files datat to keep
                    testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]

                    if not testval:  # cell address not in key file
                        out.append(b_row[x - 1])
                    elif lst[testval[0]][1] == 1:  # cell address value == 1
                        out.append(b_row[x - 1])
                    elif lst[testval[0]][1] == 2:  # cell address value == 2
                        out.append(c_row[x - 1])

                    print(cl * 10 + x + tcel)  # test output for cell location

            tcel += 520

print('success')`

密钥文件如下所示：

37794, 1
37795, 0
37796, 2

每个数据文件都很大，大约5 5GB，从计数的角度来看很复杂，但格式是标准的，看起来像这样：

0    0    0    0    0    0    0    0    0    0
1.5  1.5  0    0    0    0    0    0    0    0

这个过程需要很长时间，希望有人能帮助加快速度。

python

python-3.x

csv

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-07-20 05:32:29

我相信你的速度问题来自于下面这行：

testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]

对于大型输出文件中的每个值，您都在迭代整个键列表。这真是不太好。

看起来cl * 10 + x + tcel就是您在lst[n][0]中寻找的公式。

我建议您使用dict而不是list来存储lst中的数据。

lst = {}
for item in reader:
   lst[int(item[0])] = int(item[1])

现在，lst是一个映射，这意味着您可以简单地使用in操作符来检查是否存在键。这是一个近乎即时的查找，因为dict类型是基于散列的，并且对于键查找非常有效。

something in lst
# for example
(cl * 10 + x) in lst

您可以通过以下方式获取该值：

lst[something] 
#or
lst[cl * 10 + x]

稍微重构一下，你的代码就会大大加速。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51431769

复制

相似问题

问使用Python优化搜索两个文本文件并根据第三个文本文件进行输出
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python优化搜索两个文本文件并根据第三个文本文件进行输出EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python优化搜索两个文本文件并根据第三个文本文件进行输出
EN