Several optimization points bear mentioning. The performance impact of all of these depend upon the nature of your file collection. Taken together these optimizations may compound to do thousands of times less work. Adding no benefit at all is rare in my experience. Zeroeth, you may want to be careful of zero length files which are all identical, but also take up little to no space beyond their pathname. These may or may not be of interest in your specific use case, but are common enough to warrant mentioning. Another zeroth order concern is file identity. I-node/file numbers are better here than path names. Some separate "map out hard links" utility is better for those. First, and most importantly, the file size itself acts as a weak hash that is maintained by the OS/FS. You do not need to hash or compare files with different sizes since a different size already tells you the content is different. For many file collections, the size itself is a strong enough hash which makes the running time very similar to the time of "find". Your files could all be identical sizes in which case this does not help, but this is a very cheap preprocessing step, often barely more expensive than computing a set of files of interest. It results in a set of clusters of possible duplicates which can then be tested more precisely. In my peculiar file sets I got a speed-up of 10x .. 100x from this preprocessing step. Second, if you have a fast IO system (eg., your data all fits in RAM) then time to strongly hash likely dominates the calculation, but that also parallelizes easily. In my situations, I found parallel speed-up to be better if you just hash over a full list of files in any maybe- duplicated cluster than trying to only hash "cluster by cluster", but I mostly was finding clusters small compared to my CPU core count. Larger clusters or more need for as-you-go progress reporting could alter the desirability of this aspect of hashing parallelization. The speed-up of this step depends on how many CPU cores you have and how fast hashing is relative to IO. For well-cached-in-RAM file sets it can be order factor-of-10. Third, in theory, even strong crypto hashes have a small but non-zero chance of a collision. So, you may need to deal with the possibility of false positives anyway. If have a way to deal with a small false positive rate then you may not need to strongly hash the WHOLE file. If any well-defined subrange of files (first or last 64 KiB, say) is effective at distinguishing files already known to be the same size then limiting hashing to that range saves a factor of average-file- size/limit-size (in both IO and hashing time - so, a similar factor whichever is your throughput bottleneck). There is some interaction between hash-curtailing and parallelization in that very small limits can make the bottleneck become non-hashing costs like copying data/task switching, but for file sets with large files these are mostly independent optimizations. There can also be interaction betweeen spinning rust disk seeking and parellelization. Putting these all together, you can get a prototype program in Python not zsh that looks like this (should work on any Python version). -- from sys import argv, stderr as err, stdout as out, exc_info from os import environ, listdir, stat, lstat ; E = environ.get from hashlib import * # md5, sha1, sha256, sha512, blake2b, etc. #This is the command-interface; Could be fancier roots = argv[1:] if len(argv) > 1 else [ "." ] DIGEST = eval(E("DIGEST", "sha1")) #Can be any algo exported by hashlib MIN = int(E("MIN" , "1")) #Exclude File Smaller Than This (Bytes) MAX = int(E("MAX" , "-1")) #Limit IO/Digest work to first MAX Bytes WASTED = E("WASTED", "") #Set to non-Empty to get a wasted report from multiprocessing import Pool, cpu_count #$PAR controls max parallelism from stat import S_ISDIR, S_ISREG def findR(ino2st_paths, root, keep, follow): prefix = root + "/" #Recursive body of find() for entry in listdir(root): if entry == "." or entry == "..": continue path = prefix + entry try: st = stat(path) if follow else lstat(path) except OSError: err.write("findR: %s: %s\n" % (root, exc_info()[1][-1])) continue if S_ISDIR(st.st_mode): #Dir ==> Recurse findR(ino2st_paths, path, keep, follow) if keep(st.st_mode): try : ino2st_paths[st.st_ino].append( path ) except: ino2st_paths[st.st_ino] = [ st, path ] def find(roots, keep=S_ISREG, follow=False): ino2st_paths = {} ino2st_paths = {} for root in roots: findR(ino2st_paths, root, keep, follow) return ino2st_paths def Digest(path): try: return path, DIGEST(open(path, "rb").read(MAX)).digest() except: err.write("Cannot digest "+ path + "\n") return path, "" ## THE MAIN EVENT ## path2size = {} #path -> size to avoid stat() calls size2paths = {} #size -> paths of *first* hardlinks for sp in find(roots).values(): #find() gives { ino,[stat,path1,..] } if sp[0].st_size < MIN: continue for p in sp[1:]: path2size[p] = sp[0].st_size try : size2paths[sp[0].st_size].append(sp[1]) #only append 1st hardlink except: size2paths[sp[0].st_size] = [ sp[1] ] #which is sp[1]; see findR size_coll = [ ] #Build worklist of files w/sz collides. for paths in size2paths.values(): #Could Digest() here, but get better if len(paths) > 1: #..parallel scale-up to do worklist all size_coll.extend(paths) #..at once if most colls are "few-way". digest = {} #Build path -> digest if MAX == 0: for path in size_coll: digest[path] = path2size[path] else: #Hashing can be slow; Do most work in a proc pool pool = Pool(int(E("PAR", cpu_count()))) for (path, d) in pool.imap_unordered(Digest, size_coll): digest[path] = d + bytes(path2size[path]) paths = {} #Invert digest[path] to get collisions for path in size_coll: try : paths[ digest[path] ].append(path) #Duplicate digest detected except: paths[ digest[path] ] = [ path ] #First occurrence of digest ans = []; sz = 0 for paths in paths.values(): #Finally, print sorted answer. Also if len(paths) > 1: #..sort rows internally by mtime. ans.append("\t".join(sorted(paths, key=lambda p: stat(p).st_mtime))) sz += (len(paths) - 1) * (path2size[paths[0]]) ans.sort() #sort alphabetically by first hardlink out.write("\n".join(ans)); out.write("\n") if WASTED: out.write("Wasted Space From Duplicates: %d bytes\n" % sz) -- To re-connect topically with Zsh, you can study the impact of the file size "weak hash prefix" idea by saving the above program as "dups" and then doing: diff =(dups) =(MAX=0 dups) or diff =(dups) =(MAX=65536 dups) on a file collection using Zsh's process substitution to study effects of these ideas. Alternatively, some ambitious Python hater/Zsh lover could try to re-write the above with Zsh associative arrays, xargs -P, temp files and the like. I'm not sure I would recommend that. A C re-write would likely do better at minimizing hash curtailing and parallelizing interactions or detecting when disk seeks cause thrashing and you want to disable parallel IO.