From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.2 Received: from primenet.com.au (ns1.primenet.com.au [203.24.36.2]) by inbox.vuxu.org (OpenSMTPD) with ESMTP id 4f8b34e1 for ; Sun, 7 Apr 2019 11:18:35 +0000 (UTC) Received: (qmail 8124 invoked by alias); 7 Apr 2019 11:18:14 -0000 Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Users List List-Post: List-Help: List-Unsubscribe: X-Seq: 23903 Received: (qmail 3393 invoked by uid 1010); 7 Apr 2019 11:18:14 -0000 X-Qmail-Scanner-Diagnostics: from mail-yb1-f195.google.com by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.101.1/25405. spamassassin: 3.4.2. Clear:RC:0(209.85.219.195):SA:0(-2.0/5.0):. Processed in 4.319369 secs); 07 Apr 2019 11:18:14 -0000 X-Envelope-From: charlechaud@gmail.com X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | Received-SPF: pass (ns1.primenet.com.au: SPF record at _netblocks.google.com designates 209.85.219.195 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=x1N+6ccs5ROQr/XXb7LxH10ftK9szb4PAo1xmNkuS38=; b=boKD1bFixQadJB9jcpbbkUGb/srqaRrYlY5PJC3EvCvnnKO2AmXtNxrkeV+r8HrPu+ N10k9VYvtYhFzSQ7m3PmE12Jhy35L9ueLu1784gbG0fkxoSbTLgW4lNoZSmfUSMaSHgP vELMP+9wtrilH6OHYBkhiuUHaUtCBwBgaI/ep4HgQ8nY1xBi30VU0izG44K+2ANYHDxS RYWotna9wQzzStwsBFTEO1G/Bxk0VoULyhN504hBFLMCqTC2Z5kAgA0xu8k7dCvsJByd okS7Z4gZbRLISqr1FLkQJctPXxAQDEjjHpjHL61D4E1p10by0+4raMvpjBC5FNvsW7+M Sbbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=x1N+6ccs5ROQr/XXb7LxH10ftK9szb4PAo1xmNkuS38=; b=IHL6/f/FVg0VJP6LDKMksnxC3UrY3N1t/xl8ahTNliqZ2jnh0le0NxYteTgrXfiLtt 2TFrYT5Pmz/icEoz7RuoGcqtsUS8qUGxTZ0MvTg3jslyd0ilL14+5oOC/Pt+qTKh5jVq ixffYhqIoT+ZU8zr1kQt1hzme9wXmA2d2xnRjulGpIUFWKLpc6KCcBW1LHQxxysuq0Cg xj7zegyq5lSvAerA3S5tEnoFysZJJ4fKZZr/Ym8vF8OWMNp84PqAr9m6lHzZwOh2SlVF 3SVWCLyQugxGwOwxmv7L+vp+0ZNTbaDhJIk7OM4mPTVMMwmnPzJJrmac/5FeWsmQtpdy Mltg== X-Gm-Message-State: APjAAAVz1BWC/D5WCFHdtEj9CKE5RKFO7oCO1vuSLnUYO8TqikzzW30L 3H7MsAMtsurAP7HrIBGka/ONe3vHv66u3U+/6NhQisvO X-Google-Smtp-Source: APXvYqwHCVnWqgkhnyrmhkhhAe+8nuLVZ8vPS2PIvkerFaXnMHJhcrImv306SLPJtaeK3L8ml2i3HsWliNc3O3+sRg8= X-Received: by 2002:a5b:5d1:: with SMTP id w17mr19591124ybp.174.1554635855243; Sun, 07 Apr 2019 04:17:35 -0700 (PDT) MIME-Version: 1.0 References: <86v9zrbsic.fsf@zoho.eu> <20190406130242.GA29292@trot> <86tvfb9ore.fsf@zoho.eu> In-Reply-To: From: Charles Blake Date: Sun, 7 Apr 2019 07:16:59 -0400 Message-ID: Subject: Re: find duplicate files To: zsh-users@zsh.org Content-Type: multipart/alternative; boundary="000000000000e82f2b0585eedd16" --000000000000e82f2b0585eedd16 Content-Type: text/plain; charset="UTF-8" Several optimization points bear mentioning. The performance impact of all of these depend upon the nature of your file collection. Taken together these optimizations may compound to do thousands of times less work. Adding no benefit at all is rare in my experience. Zeroeth, you may want to be careful of zero length files which are all identical, but also take up little to no space beyond their pathname. These may or may not be of interest in your specific use case, but are common enough to warrant mentioning. Another zeroth order concern is file identity. I-node/file numbers are better here than path names. Some separate "map out hard links" utility is better for those. First, and most importantly, the file size itself acts as a weak hash that is maintained by the OS/FS. You do not need to hash or compare files with different sizes since a different size already tells you the content is different. For many file collections, the size itself is a strong enough hash which makes the running time very similar to the time of "find". Your files could all be identical sizes in which case this does not help, but this is a very cheap preprocessing step, often barely more expensive than computing a set of files of interest. It results in a set of clusters of possible duplicates which can then be tested more precisely. In my peculiar file sets I got a speed-up of 10x .. 100x from this preprocessing step. Second, if you have a fast IO system (eg., your data all fits in RAM) then time to strongly hash likely dominates the calculation, but that also parallelizes easily. In my situations, I found parallel speed-up to be better if you just hash over a full list of files in any maybe- duplicated cluster than trying to only hash "cluster by cluster", but I mostly was finding clusters small compared to my CPU core count. Larger clusters or more need for as-you-go progress reporting could alter the desirability of this aspect of hashing parallelization. The speed-up of this step depends on how many CPU cores you have and how fast hashing is relative to IO. For well-cached-in-RAM file sets it can be order factor-of-10. Third, in theory, even strong crypto hashes have a small but non-zero chance of a collision. So, you may need to deal with the possibility of false positives anyway. If have a way to deal with a small false positive rate then you may not need to strongly hash the WHOLE file. If any well-defined subrange of files (first or last 64 KiB, say) is effective at distinguishing files already known to be the same size then limiting hashing to that range saves a factor of average-file- size/limit-size (in both IO and hashing time - so, a similar factor whichever is your throughput bottleneck). There is some interaction between hash-curtailing and parallelization in that very small limits can make the bottleneck become non-hashing costs like copying data/task switching, but for file sets with large files these are mostly independent optimizations. There can also be interaction betweeen spinning rust disk seeking and parellelization. Putting these all together, you can get a prototype program in Python not zsh that looks like this (should work on any Python version). -- from sys import argv, stderr as err, stdout as out, exc_info from os import environ, listdir, stat, lstat ; E = environ.get from hashlib import * # md5, sha1, sha256, sha512, blake2b, etc. #This is the command-interface; Could be fancier roots = argv[1:] if len(argv) > 1 else [ "." ] DIGEST = eval(E("DIGEST", "sha1")) #Can be any algo exported by hashlib MIN = int(E("MIN" , "1")) #Exclude File Smaller Than This (Bytes) MAX = int(E("MAX" , "-1")) #Limit IO/Digest work to first MAX Bytes WASTED = E("WASTED", "") #Set to non-Empty to get a wasted report from multiprocessing import Pool, cpu_count #$PAR controls max parallelism from stat import S_ISDIR, S_ISREG def findR(ino2st_paths, root, keep, follow): prefix = root + "/" #Recursive body of find() for entry in listdir(root): if entry == "." or entry == "..": continue path = prefix + entry try: st = stat(path) if follow else lstat(path) except OSError: err.write("findR: %s: %s\n" % (root, exc_info()[1][-1])) continue if S_ISDIR(st.st_mode): #Dir ==> Recurse findR(ino2st_paths, path, keep, follow) if keep(st.st_mode): try : ino2st_paths[st.st_ino].append( path ) except: ino2st_paths[st.st_ino] = [ st, path ] def find(roots, keep=S_ISREG, follow=False): ino2st_paths = {} ino2st_paths = {} for root in roots: findR(ino2st_paths, root, keep, follow) return ino2st_paths def Digest(path): try: return path, DIGEST(open(path, "rb").read(MAX)).digest() except: err.write("Cannot digest "+ path + "\n") return path, "" ## THE MAIN EVENT ## path2size = {} #path -> size to avoid stat() calls size2paths = {} #size -> paths of *first* hardlinks for sp in find(roots).values(): #find() gives { ino,[stat,path1,..] } if sp[0].st_size < MIN: continue for p in sp[1:]: path2size[p] = sp[0].st_size try : size2paths[sp[0].st_size].append(sp[1]) #only append 1st hardlink except: size2paths[sp[0].st_size] = [ sp[1] ] #which is sp[1]; see findR size_coll = [ ] #Build worklist of files w/sz collides. for paths in size2paths.values(): #Could Digest() here, but get better if len(paths) > 1: #..parallel scale-up to do worklist all size_coll.extend(paths) #..at once if most colls are "few-way". digest = {} #Build path -> digest if MAX == 0: for path in size_coll: digest[path] = path2size[path] else: #Hashing can be slow; Do most work in a proc pool pool = Pool(int(E("PAR", cpu_count()))) for (path, d) in pool.imap_unordered(Digest, size_coll): digest[path] = d + bytes(path2size[path]) paths = {} #Invert digest[path] to get collisions for path in size_coll: try : paths[ digest[path] ].append(path) #Duplicate digest detected except: paths[ digest[path] ] = [ path ] #First occurrence of digest ans = []; sz = 0 for paths in paths.values(): #Finally, print sorted answer. Also if len(paths) > 1: #..sort rows internally by mtime. ans.append("\t".join(sorted(paths, key=lambda p: stat(p).st_mtime))) sz += (len(paths) - 1) * (path2size[paths[0]]) ans.sort() #sort alphabetically by first hardlink out.write("\n".join(ans)); out.write("\n") if WASTED: out.write("Wasted Space From Duplicates: %d bytes\n" % sz) -- To re-connect topically with Zsh, you can study the impact of the file size "weak hash prefix" idea by saving the above program as "dups" and then doing: diff =(dups) =(MAX=0 dups) or diff =(dups) =(MAX=65536 dups) on a file collection using Zsh's process substitution to study effects of these ideas. Alternatively, some ambitious Python hater/Zsh lover could try to re-write the above with Zsh associative arrays, xargs -P, temp files and the like. I'm not sure I would recommend that. A C re-write would likely do better at minimizing hash curtailing and parallelizing interactions or detecting when disk seeks cause thrashing and you want to disable parallel IO. --000000000000e82f2b0585eedd16--