文章/答案/技术大牛

发布

社区首页 >问答首页 >文件比较工具(基于校验和的Windows 2016/10)

问文件比较工具(基于校验和的Windows 2016/10)
EN

Software Recommendation用户

提问于 2020-03-17 20:03:43

回答 1查看 80关注 0票数 1

因此，我正在寻找一个工具，可以根据校验和比较文件夹中的文件(这很常见，不难找到)；然而，我的用例是，这些文件可以存在于相当深的文件夹路径中，并且可以改变，我希望每隔几个月比较一次，只创建一个不同文件的包。我不在乎文件在哪个文件夹中，同一个文件可以定期地在文件夹之间移动，而且文件不会更改名称，只更改内容(所以必须进行校验和)。

我的问题是，几乎所有我能找到的工具在比较文件夹时都关心文件夹路径，我不想，实际上我希望它忽略文件夹路径。我不想开发任何东西，或者至少只需要开发一小部分的过程来节省时间。

为了明确我想要发生的事情的顺序是：

Program scans directory from 1/1/2020 (A).
Program scans directory from 4/1/2020 (B)
Finds all checksums in B that don't exist in A and make a new folder with the files that are different (C).

有什么想法吗？而且，这只需要每4个月发生一次，并且只覆盖大约47 to (32,000个文件)。如果它持续运行18个小时，那就完全没问题了。我只是需要它起作用。

file-management

server

windows-server

回答 1

Software Recommendation用户

发布于 2020-03-18 09:30:15

我建议使用比简单校验和更复杂的散列，以避免哈希冲突的可能性。可能的哈希值是SHA-1，SHA-256等.

您可以在几乎任何平台上使用几行python来实现这一点，方法是使用内置库，特别是os.walk遍历您的目录结构& hashlib来计算哈希值。您甚至可以使用拉链文件创建新的/更改的文件的压缩。就我个人而言，我会采取如下行动：

#!python # the following code assumes python 3.8 or higher
import os
import datetime
import pickle # used to store the dictionary between runs
import hashlib

CHUNK_SIZE = 1024*1024  # A megabyte at a time adjust if necessary
TREE_ROOT = "/top/of/tree" # Where ever that is
SHA_FILE = "/some/other/path/tree_shas.pickle" # Adjust as needed

def hashfile(filepath):
    """ Calculate the hash of a single file """
    with open(filepath, 'rb') as infile:
        sha = hashlib.sha256()
        while chunk := infile.read(CHUNK_SIZE): # This will only work for python >3.8
            sha.update(chunk)
    return sha.digest()

# The above tested with a 12 MB file and took 39 msecs on my laptop

def check_tree(startfrom, last_shas):
   """ Check the contents of a tree against the sha values in last_shas list """
   newshas = set() # Empty Set
   for root, dirs, files in os.walk(startfrom):
       # You can skip some directories by removing them from dirs if present
       print(root, len(files), "Files", end="\r") # So we can see some progress
       for fname in files:
           sha = hashfile(os.path.join(root, fname))
           if sha not in last_shas:
               print("New/Changed file:", os.path.join(root, fname) # or some other action
           newshas.add(sha)
    return newshas

def main():
    """ Main Processing """
    started = datetime.datetime.now()
    sha_list = set() # Start with none
    if os.path.exists(SHA_FILE):
        sha_list = pickle.load(open(SHA_FILE, 'rb'))
    new_shas = check_tree(TREE_ROOT, sha_list)
    pickle.dump(new_shas, open(SHA_FILE, 'rb'), 4)
    print(f"\n\nCalculated {len(new_shas)} in {datetime.datetime.now() - started}")
    print(f"{len(new_shas.difference(sha_list)} New/Changed files")

if __name__ == "__main__":
    main()

免费、开放源码和免费
几乎所有的平台
您可能没有空间将所有SHA值存储在RAM中，因此可能需要更复杂一些。
与其打印已更改的文件名，不如在脚本中执行需要执行的任何其他操作。
适合支持chron任务的平台上的任务

票数 1

页面原文内容由Software Recommendation提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://softwarerecs.stackexchange.com/questions/72546

复制

相似问题

问文件比较工具(基于校验和的Windows 2016/10)
EN

回答 1

Software Recommendation用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问文件比较工具(基于校验和的Windows 2016/10)EN

回答 1

Software Recommendation用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问文件比较工具(基于校验和的Windows 2016/10)
EN