Apologies if my 70 line Python script post added to that confusion. I thought it better to include working code rather than merely a list of theoretical optimizations to consider in a pure Zsh implementation. In my experience lately, availability of Python is greater than that of either Zsh or a C compiler. So, there is also that, I suppose. In timings I just did, my Python was only 1.25x slower than 'duff' (in MAX=-1/hash everything mode). Most of the time is in the crypto hashing which the Py just calls fast impls for. So, that's not so surprising. Volodymyr's pipeline took about 20X longer and Bart's pure Zsh took 10X longer. jdupes was 20X faster than the Python but missed 87% of my dups. [ I have not tried to track down why..conceivably wonky pathnames making it abort early. That seems the most likely culprit. ] The unix duff seems better designed with a -0 option, FWIW. One other optimization (not easily exhibited without a low level lang) is to not use a cryptographic hash at all, but to use a super fast one and always do the equivalent of a "cmp" only on matching hashes (or maybe a slow hash only after a fast hash). That is the jdupes approach. I think going parallel adds more value than that, though. At least in server/laptop/desktop CPUs from around Intel's Nehalem onward, just one core has been unable to saturate DIMM BW. I usually get 2X-8X more DIMM BW going multi-core. So, even infinitely fast hashes doesn't mean parallelism could not speed things by a big factor for RAM resident or superfast IO backed sets of (equally sized) files. At those speeds, Python's cPickle BW would likely suck enough compared to hash/cmp-ing that you'd need a C-like impl to max out your performance. This may *seem* to be drifting more off topic, but actually does speak to Ray's original question. Hardware has evolved enough enough over the last 50 years that a tuned implementations from yester-decade had concerns different from a tuned implementation today. RAM is giant compared to (many) file sets, IO fast compared to RAM, CPU cores abundant. Of the 5 solutions discussed (pure Zsh, Volodymyr's, Py, duff, jdupes), only my Python one used parallelism to any good effect. Then there is variation in what optimization applies to what file sets..so, a good tool would probably provide all of them. So, even though there may be dozens to hundreds of such tools, there may well *still* be room for some new tuned impl in a fast language taking all the optimizations I've mentioned into account instead of just a subset, where any single optimization might, depending upon deployment context, make the interesting procedure "several to many" times faster. [ Maybe some Rust person already did it. They seem to have a lot of energy. ;-) ] I agree the original question from Emanuel Berg was probably just unaware of any tool, though, or asked out of curiosity. Anyway, enough about the most optimal approaches. It was probably always too large a topic. Cheers