我目前正在尝试接F#。这是我第一次使用.NET语言,因此我对可用的API非常陌生。
作为一个初学者的项目,我想实现我自己的重复文件查找器。有人建议我使用校验和,因为我正在比较的文件的大小相当大(大部分在1MB到10 1MB之间)。
到目前为止,这就是我所做的:在检查文件长度之后,通过将文件的所有字节读入字节数组来比较具有相同文件长度的文件。现在,我想使用MD5计算每个字节数组的哈希值,然后删除共享相同哈希值的重复文件。
我有一些问题:
谢谢你的帮助。我可能会在你的答复中贴出后续问题。
编辑:
let readAllBytesMD5 (tupleOfFileLengthsAndFiles) =
let md5 = MD5.Create()
tupleOfFileLengthsAndFiles
|> snd
|> Seq.map (fun eachFile -> (File.ReadAllBytes eachFile, eachFile))
|> Seq.groupBy fst
|> Seq.map (fun (byteArray, eachFile) -> (md5.ComputeHash(byteArray), eachFile))
我希望提取具有多个值(对应文件)的键(散列字节数组),并删除重复的文件。如何从上面的代码示例中改进和继续?我不熟悉MD5的工作原理,所以我被困在这里了。如有任何建议,将不胜感激。
发布于 2015-03-08 05:34:01
在这个任务中使用MD5没有什么特别的问题。但是,MD5不再被认为是“强”散列;对于确定的各方来说,创建具有不同内容但具有相同MD5哈希的文件相对简单。
还有一个更健壮的选择,我建议您使用SHA-2散列之一,比如SHA256。
但是,性能注意事项:哈希只有在缓存文件的散列(并随着文件的添加/删除/修改而逐步更新)时才能提高工具的性能。如果不缓存散列,则需要在每次发现冲突时读取两个文件的全部内容并计算它们的散列;如果此工具仅用于临时去复制,那么每当您发现相同大小的文件时,比较这些文件的内容可能会更快或更简单。
编辑:这是一些你可以使用的示例代码。它将检测重复项,但需要编写另一个函数来确定如何解决冲突(例如,您可能希望保留最早创建的任何文件)。
open System.IO
open System.Security.Cryptography
/// Given a sequence of filenames, looks for duplicate files by comparing file lengths
/// and, if necessary, hash values calculated using the specified hash algorithm.
/// Returns a sequence of tuples; the first item in the tuple is a hash value and the
/// second item is a sequence containing the names of two or more files which have
/// the same length and hash value.
let findDuplicateFiles (algorithm : HashAlgorithm) (filenames : seq<string>) =
filenames
|> Seq.groupBy (fun filename ->
(FileInfo filename).Length)
|> Seq.collect (fun (_, sameLengthFilenames) ->
// If there's only one file with this length, there's no duplication so don't return it.
if Seq.length sameLengthFilenames = 1 then Seq.empty
else
// Possible duplication. Resolve by hashing the files and comparing the hashes.
sameLengthFilenames
|> Seq.groupBy (fun filename ->
using (File.OpenRead filename) algorithm.ComputeHash)
// Check for multiple files with the same hash value.
// Return any such filenames so outside code can determine how to handle them.
|> Seq.filter (fun (_, sameLengthFilenames) ->
// Collision when two or more files have the same hash.
Seq.length sameLengthFilenames > 2))
/// Given a sequence of filenames, looks for duplicate files by comparing file lengths
/// and, if necessary, hash values calculated using the SHA256 algorithm.
/// Returns a sequence of tuples; the first item in the tuple is a hash value and the
/// second item is a sequence containing the names of two or more files which have
/// the same length and hash value.
let findDuplicateFilesSHA256 filenames =
// NOTE: The algorithm should be bound with 'use' or 'using' here so it can be disposed,
// but the F# 3.1 compiler appears to dispose the object too early.
findDuplicateFiles (SHA256.Create()) filenames
//
let printDuplicateEntry (hash : byte[], filenames : seq<string>) =
stdout.WriteLine ""
stdout.Write "Hash: "
stdout.WriteLine (System.BitConverter.ToString(hash).Replace("-", ""))
for filename in filenames do
printfn " %s (Length: %i)" filename ((FileInfo filename).Length)
//
let findDuplicateFilesInDirectory path =
Directory.EnumerateFiles (path)
|> findDuplicateFilesSHA256
|> Seq.iter printDuplicateEntry
;;
// Example usage:
findDuplicateFilesInDirectory @"C:\Users\Jack\Desktop";;
https://stackoverflow.com/questions/28926709
复制相似问题