我有两个csv文件,每个都有3 GB的大小,用来比较和存储第三个文件中的差异。
Python代码:
with open('JUN-01.csv', 'r') as f1:
file1 = f1.readlines()
with open('JUN-02.csv', 'r') as f2:
file2 = f2.readlines()
with open('JUN_Updates.csv', 'w') as outFile:
outFile.write(file1[0])
for line in file2:
if line not in file1:
outFile.write(line)
执行时间:45分钟且仍在运行...
发布于 2018-07-26 02:00:41
不知道是不是已经太晚了,但它来了。
我看到你正在加载内存中的2个数组,以及你的完整文件。如果您说它们每个大约3 GB,那就是试图在RAM中填充6 GB,并可能进入交换空间。
此外,即使您成功加载了文件,您也会尝试进行~ L1xL2字符串比较(L1和L2是行数)。
我在1.2 GB (330万行)中运行了以下代码,并在几秒钟内完成。它使用字符串散列,并且只在内存中加载一组L1 integer32。
这里的技巧是,在将hashstring函数应用于文件中的每一行(除了头文件之外,您似乎要将其添加到输出中)之后创建一个set()。
file1 = set(map(hashstring, f1))
请注意,我正在将文件与自身进行比较(f2加载与f1相同的文件)。如果有帮助,请告诉我。
from zlib import adler32
def hashstring(s):
return adler32(s.encode('utf-8'))
with open('haproxy.log.1', 'r') as f1:
heading = f1.readline()
print(f'Heading: {heading}')
print('Hashing')
file1 = set(map(hashstring, f1))
print(f'Hashed: {len(file1)}')
with open('updates.log', 'w') as outFile:
count = 0
outFile.write(heading)
with open ('haproxy.log.1', 'r') as f2:
for line in f2:
if hashstring(line) not in file1:
outFile.write(line)
count += 1
if 0 == count % 10000:
print(f'Checked: {count}')
发布于 2018-06-04 18:55:31
如果difflib可以帮助提高效率,请尝试以下方法:
import difflib
import sys
with open('JUN_Updates.csv', 'w') as differenceFile:
with open('JUN-01.csv', 'r') as june1File:
with open('JUN-02.csv', 'r') as june2File:
diff = difflib.unified_diff(
june1File.readlines(),
june2File.readlines(),
fromfile='june1File',
tofile='june2File',
)
lines = list(diff)[2:]
added = [line[1:] for line in lines if line[0] == '+']
removed = [line[1:] for line in lines if line[0] == '-']
for line in added:
differenceFile.write(line)
https://stackoverflow.com/questions/50678710
复制相似问题