首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >用于在OSX上数千个文件目录中搜索二进制文件匹配的高性能搜索工具。

用于在OSX上数千个文件目录中搜索二进制文件匹配的高性能搜索工具。
EN

Stack Overflow用户
提问于 2017-04-19 22:55:18
回答 1查看 132关注 0票数 1

我正在将两组大(1000 S)的照片合并到不同的目录结构中,其中已经有许多照片存在于这两组中。我要写一个剧本这样:

代码语言:javascript
运行
复制
For a given photo in set B,
Check if a binary match for it exists in set A.
If there's a match, delete the file.

在检查完集合B中的所有文件之后,我将把B的(现在是唯一的)剩余部分合并到集合A中。

可能存在具有不同文件名的二进制匹配,因此在测试时应忽略文件名。

另外,我将对集合B中的每一个文件执行set A搜索,所以我更喜欢在初始扫描中构建一个集合A索引的工具。幸运的是,这个索引可以完成一次,而且永远不需要更新。

我本来打算使用OSX脚本,但是python也很好。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-04-21 15:12:05

我根据Mark的建议编写了一对Python脚本来解决我的问题。

md5index.py:

代码语言:javascript
运行
复制
#given a folder path, makes a hash index of every file, recursively
import sys, os, hashlib, io

hash_md5 = hashlib.md5()

#some files need to be hashed incrementally as they may be too big to fit in memory
#http://stackoverflow.com/a/40961519/2518451
def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
    md5 = hashlib.md5()
    with io.open(src, mode="rb") as fd:
        for chunk in iter(lambda: fd.read(length), b''):
            md5.update(chunk)
    return md5

#this project done on macOS. There may be other files that are appropriate to hide on other platforms.
ignore_files = [".DS_Store"]

def index(source, index_output):

    index_output_f = open(index_output, "wt")
    index_count = 0

    for root, dirs, filenames in os.walk(source):

        for f in filenames:
            if f in ignore_files:
                continue

            #print f
            fullpath = os.path.join(root, f)
            #print fullpath

            md5 = md5sum(fullpath)
            md5string = md5.hexdigest()
            line = md5string + ":" + fullpath
            index_output_f.write(line + "\n")
            print line
            index_count += 1

    index_output_f.close()
    print("Index Count: " + str(index_count))


if __name__ == "__main__":
    index_output = "index_output.txt"

    if len(sys.argv) < 2:
        print("Usage: md5index [path]")
    else:
        index_path = sys.argv[1]
        print("Indexing... " + index_path)
        index(index_path, index_output)

和uniquemerge.py:

代码语言:javascript
运行
复制
#given an index_output.txt in the same directory and an input path,
#remove all files that already have a hash in index_output.txt

import sys, os
from md5index import md5sum
from send2trash import send2trash
SENDING_TO_TRASH = True

def load_index():
    index_output = "index_output.txt"
    index = []
    with open(index_output, "rt") as index_output_f:
        for line in index_output_f:
            line_split = line.split(':')
            md5 = line_split[0]
            index.append(md5)
    return index

#traverse file, compare against index
def traverse_merge_path(merge_path, index):
    found = 0
    not_found = 0

    for root, dirs, filenames in os.walk(merge_path):
        for f in filenames:
            #print f
            fullpath = os.path.join(root, f)
            #print fullpath

            md5 = md5sum(fullpath)
            md5string = md5.hexdigest()

            if md5string in index:
                if SENDING_TO_TRASH:
                    send2trash(fullpath)

                found += 1
            else:
                print "\t NON-DUPLICATE ORIGINAL: " + fullpath
                not_found += 1


    print "Found Duplicates: " + str(found) + " Originals: " + str(not_found)


if __name__ == "__main__":
    index = load_index()
    print "Loaded index with item count: " + str(len(index))

    print "SENDING_TO_TRASH: " + str(SENDING_TO_TRASH) 

    merge_path = sys.argv[1]
    print "Merging To: " + merge_path

    traverse_merge_path(merge_path, index)

假设我想将folderA合并到folderB中,我会这样做: python md5index.py folderA #使用来自folderA的所有散列创建index_output.txt

代码语言:javascript
运行
复制
python uniquemerge.py folderB
# deletes all files in folderB that already existed in folderA
# I can now manually merge folderB into folderA
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/43507418

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档