>I find that a LOT more understandable than the python code. Understandability is, of course, somewhat subjective (e.g. some might say every 15th field is unclear relative to a named label), but the python also does do more and would be quite a bit simpler without any parallelism, and the Zsh code more complex with a set up for it. So, it's a somewhat apples & oranges/strawman comparison. No offense intended. I don't dislike your solution or a simpler Python one. Simplicity is good. As elaborated below, everything sort of "just depends". Some hash functions are really fast..some IO systems really slow, etc. To really do the parallel IO/hashing right (be nice to competition both from self & programs on the system) you probably want to start with a parallelism of 1 CPU and dynamically adjust based on CPU utilization say after each file via getrusage in some parallel thread coordinator). Such would be more complex still, but as we both mentioned will be more efficient in a lower level language. >This should be vanishingly small if the sizes are the same? Of course, if you have a hash with no collision attack or no reason to suspect a filesystem collision attack, then yes, a vanishingly small probability. I did not mean to overemphasize the risk as much as use it to introduce the "IO/hash truncated" optimization idea trading off some false positive risk with performance. I personally used the report for rare-case manual intervention, not automated response. So, cmp at the end makes total sense. I don't think there is any disagreement here. Another motivation on the truncated IO/hash optimization that may be underappreciated is that if your report is empty then you are done. False negatives don't happen. That might well be the common case after some situation is cleaned up and you want a "quick check" that there are no duplicates or if you otherwise "suspect but do not know" no duplicates for some file set. If that quick check fails then you can do a slower, more precise job. With large files or slow IO this can help a lot. Call it an 'is it possible there's a duplicate' operation if you will. Partial hashes can also be a jumping off point to detection of "corrupt" or near-duplicate files. The many ways corruption can happen generates a list of questions/approaches which is off topic. One way it can happen is non-size-changing scribbling over/non-population of just a few bytes which does relate to partial hashing, though obviously it's not perfectly tuned for such an application. >I haven't dealt with the files having strange characters in the names >that might prevent them being easily used as hash keys, that's left as >an exercise Part of my purpose writing the Python was indeed being able to deal with weird path names which I also generally try hard to avoid, but sometimes it's beyond one's control/authority. I almost added user parameter for the hard TAB and NEWLINE chars used for output delimiters, but tried to keep it simple for exposition. With utf8 these days...truly general filenames are decidedly a PITA. >unless you're NOT going to consider linked files as duplicates you >might as well just compare sizes. (It would be faster to get inodes It may have been underappreciated is that handling hard-link identity also lets you skip recomputing hashes over a hard link cluster which must be the same. (Or re-doing the IO if the time between hitting that i-node is long enough to lose buffer caching.) One thinkable, possibly even automatic, resolution of duplicates is conversion into hard links. If there was a lot of duplication resolved this way then this would save time on a "second pass" over such file hierarchies. I'm just defending that admittedly likely minor optimization (which just wasn't so hard to add in Python). Most claims in this whole space depend upon application context in some way. Some are hard, like files being updated during the duplicate scan. Yeah..You can look at time stamps, and maybe re-compute things, but ick. If things are changing who knows what version the user wants assessed? Duplicates are probably rare. Large trees of heavily hard-linked files are probably a much smaller optimization than using file size as a weak hash to organize the calculation. You can probably find a hash function faster than your IO. Or maybe you have files on some quad m.2 NVMe raid setup delivering 13 gigabytes/sec and it is harder to match that with a crypto hash function. Almost everything you say needs a "probably/maybe" qualifier. I don't think you disagree. I'm just elaborating a little for passers by. P.S.: Two of my side-comments got auto-wrapped at 76 text columns. Should be an easy fix for anyone cutting & pasting that python code. Sorry!