因此,我正在寻找一个工具,可以根据校验和比较文件夹中的文件(这很常见,不难找到);然而,我的用例是,这些文件可以存在于相当深的文件夹路径中,并且可以改变,我希望每隔几个月比较一次,只创建一个不同文件的包。我不在乎文件在哪个文件夹中,同一个文件可以定期地在文件夹之间移动,而且文件不会更改名称,只更改内容(所以必须进行校验和)。
我的问题是,几乎所有我能找到的工具在比较文件夹时都关心文件夹路径,我不想,实际上我希望它忽略文件夹路径。我不想开发任何东西,或者至少只需要开发一小部分的过程来节省时间。
为了明确我想要发生的事情的顺序是:
Program scans directory from 1/1/2020 (A).
Program scans directory from 4/1/2020 (B)
Finds all checksums in B that don't exist in A and make a new folder with the files that are different (C).
有什么想法吗?而且,这只需要每4个月发生一次,并且只覆盖大约47 to (32,000个文件)。如果它持续运行18个小时,那就完全没问题了。我只是需要它起作用。
发布于 2020-03-18 09:30:15
我建议使用比简单校验和更复杂的散列,以避免哈希冲突的可能性。可能的哈希值是SHA-1,SHA-256等.
您可以在几乎任何平台上使用几行python来实现这一点,方法是使用内置库,特别是os.walk遍历您的目录结构& hashlib来计算哈希值。您甚至可以使用拉链文件创建新的/更改的文件的压缩。就我个人而言,我会采取如下行动:
#!python # the following code assumes python 3.8 or higher
import os
import datetime
import pickle # used to store the dictionary between runs
import hashlib
CHUNK_SIZE = 1024*1024 # A megabyte at a time adjust if necessary
TREE_ROOT = "/top/of/tree" # Where ever that is
SHA_FILE = "/some/other/path/tree_shas.pickle" # Adjust as needed
def hashfile(filepath):
""" Calculate the hash of a single file """
with open(filepath, 'rb') as infile:
sha = hashlib.sha256()
while chunk := infile.read(CHUNK_SIZE): # This will only work for python >3.8
sha.update(chunk)
return sha.digest()
# The above tested with a 12 MB file and took 39 msecs on my laptop
def check_tree(startfrom, last_shas):
""" Check the contents of a tree against the sha values in last_shas list """
newshas = set() # Empty Set
for root, dirs, files in os.walk(startfrom):
# You can skip some directories by removing them from dirs if present
print(root, len(files), "Files", end="\r") # So we can see some progress
for fname in files:
sha = hashfile(os.path.join(root, fname))
if sha not in last_shas:
print("New/Changed file:", os.path.join(root, fname) # or some other action
newshas.add(sha)
return newshas
def main():
""" Main Processing """
started = datetime.datetime.now()
sha_list = set() # Start with none
if os.path.exists(SHA_FILE):
sha_list = pickle.load(open(SHA_FILE, 'rb'))
new_shas = check_tree(TREE_ROOT, sha_list)
pickle.dump(new_shas, open(SHA_FILE, 'rb'), 4)
print(f"\n\nCalculated {len(new_shas)} in {datetime.datetime.now() - started}")
print(f"{len(new_shas.difference(sha_list)} New/Changed files")
if __name__ == "__main__":
main()
chron
任务的平台上的任务https://softwarerecs.stackexchange.com/questions/72546
复制相似问题