From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.2 Received: from primenet.com.au (ns1.primenet.com.au [203.24.36.2]) by inbox.vuxu.org (OpenSMTPD) with ESMTP id 80a6e951 for ; Mon, 8 Apr 2019 11:19:26 +0000 (UTC) Received: (qmail 13808 invoked by alias); 8 Apr 2019 11:19:09 -0000 Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Users List List-Post: List-Help: List-Unsubscribe: X-Seq: 23905 Received: (qmail 7495 invoked by uid 1010); 8 Apr 2019 11:19:09 -0000 X-Qmail-Scanner-Diagnostics: from mail-yb1-f172.google.com by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.101.1/25412. spamassassin: 3.4.2. Clear:RC:0(209.85.219.172):SA:0(-2.0/5.0):. Processed in 3.48095 secs); 08 Apr 2019 11:19:09 -0000 X-Envelope-From: charlechaud@gmail.com X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | Received-SPF: pass (ns1.primenet.com.au: SPF record at _netblocks.google.com designates 209.85.219.172 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tty280RCJV6VvmgPXeHhLsMwii51+UpbiPcwoZxb+1Y=; b=Vu9edYKzk3s13KZuHi2wVYQxfNpnrlgG9nD6a0shbe2FLb1HIjAraQzuAG+KhLmk+e A13yTr9VBKeBSd9KiT3tbRTvhoJfJrhLlYbCXdVdyXYksO6ZywpEbkadXKQXisPXKsNa Xzuiby/nAeopZvAd3evSqI0gCUjTNWz6hyGSfDfVrkA8/ucwS2g7+aw14G2OsI3XX1lY M14XicOtgO8rhEjg2ogKNBL8ET3eYvH1EQdeu1TMhZML7aisYDRJHng/T+ySfLh+OK9M k4W+2h8p+WpJpEoInh4Hw3X6d86UU33EtHq2rgE7vIGowsgUDGcdKT+9ZDLZXC+rmtuE WTMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tty280RCJV6VvmgPXeHhLsMwii51+UpbiPcwoZxb+1Y=; b=UzuNGJe4XhyXULYmtvvJp2LPc8UgHRRwM03udTcJqi2zy6WbDdZvAZxw7tv5ziBvN3 2iCjaU/gQc2dQvm7cxL6iDPEwQ1qtELURMeRkdPNt6gGV9oWqHWaKDPTxuCNSKuOELgd WE8qvM35sZGCawBnL2jn4bMxCEz1ONeMuaPD4PMF+wmKgp0DMf049XL1I2dgoKk+R27j 5H6u548HtJI1VeWSIC9kJ79nh03fdvWv+FrBTsUYLvKXAYj8/oXy9ANEgbpMJvurEcrY 7TFo7SMrLb/6GEtK5+MDpgaOxRn3avK8mstsB9Q2/+zF+VpN2CVsHjyNJirjZmkiCMg/ 7RKg== X-Gm-Message-State: APjAAAViNzDXbUYsrpp3HvNqngUsOLwZSbpe7EBrOynia5AiB6Y0OH7/ 3/OY2J/+7a5+uVlQ3UXp8ILgSxgStlxuQaa6fJkKVNZ8 X-Google-Smtp-Source: APXvYqzmg80V94IqSrA0oHsrfjL0xisc7ojQ3FFvJuRl2gwgk4VEvXNXLRGa/3F/E3UmkCDxHZ2hrfY8UQ2ek26w2jc= X-Received: by 2002:a5b:5d1:: with SMTP id w17mr23982753ybp.174.1554722311820; Mon, 08 Apr 2019 04:18:31 -0700 (PDT) MIME-Version: 1.0 References: <86v9zrbsic.fsf@zoho.eu> <20190406130242.GA29292@trot> <86tvfb9ore.fsf@zoho.eu> In-Reply-To: From: Charles Blake Date: Mon, 8 Apr 2019 07:17:56 -0400 Message-ID: Subject: Re: find duplicate files To: Bart Schaefer Cc: Zsh Users Content-Type: multipart/alternative; boundary="0000000000001ed9e7058602ffb9" --0000000000001ed9e7058602ffb9 Content-Type: text/plain; charset="UTF-8" >I find that a LOT more understandable than the python code. Understandability is, of course, somewhat subjective (e.g. some might say every 15th field is unclear relative to a named label), but the python also does do more and would be quite a bit simpler without any parallelism, and the Zsh code more complex with a set up for it. So, it's a somewhat apples & oranges/strawman comparison. No offense intended. I don't dislike your solution or a simpler Python one. Simplicity is good. As elaborated below, everything sort of "just depends". Some hash functions are really fast..some IO systems really slow, etc. To really do the parallel IO/hashing right (be nice to competition both from self & programs on the system) you probably want to start with a parallelism of 1 CPU and dynamically adjust based on CPU utilization say after each file via getrusage in some parallel thread coordinator). Such would be more complex still, but as we both mentioned will be more efficient in a lower level language. >This should be vanishingly small if the sizes are the same? Of course, if you have a hash with no collision attack or no reason to suspect a filesystem collision attack, then yes, a vanishingly small probability. I did not mean to overemphasize the risk as much as use it to introduce the "IO/hash truncated" optimization idea trading off some false positive risk with performance. I personally used the report for rare-case manual intervention, not automated response. So, cmp at the end makes total sense. I don't think there is any disagreement here. Another motivation on the truncated IO/hash optimization that may be underappreciated is that if your report is empty then you are done. False negatives don't happen. That might well be the common case after some situation is cleaned up and you want a "quick check" that there are no duplicates or if you otherwise "suspect but do not know" no duplicates for some file set. If that quick check fails then you can do a slower, more precise job. With large files or slow IO this can help a lot. Call it an 'is it possible there's a duplicate' operation if you will. Partial hashes can also be a jumping off point to detection of "corrupt" or near-duplicate files. The many ways corruption can happen generates a list of questions/approaches which is off topic. One way it can happen is non-size-changing scribbling over/non-population of just a few bytes which does relate to partial hashing, though obviously it's not perfectly tuned for such an application. >I haven't dealt with the files having strange characters in the names >that might prevent them being easily used as hash keys, that's left as >an exercise Part of my purpose writing the Python was indeed being able to deal with weird path names which I also generally try hard to avoid, but sometimes it's beyond one's control/authority. I almost added user parameter for the hard TAB and NEWLINE chars used for output delimiters, but tried to keep it simple for exposition. With utf8 these days...truly general filenames are decidedly a PITA. >unless you're NOT going to consider linked files as duplicates you >might as well just compare sizes. (It would be faster to get inodes It may have been underappreciated is that handling hard-link identity also lets you skip recomputing hashes over a hard link cluster which must be the same. (Or re-doing the IO if the time between hitting that i-node is long enough to lose buffer caching.) One thinkable, possibly even automatic, resolution of duplicates is conversion into hard links. If there was a lot of duplication resolved this way then this would save time on a "second pass" over such file hierarchies. I'm just defending that admittedly likely minor optimization (which just wasn't so hard to add in Python). Most claims in this whole space depend upon application context in some way. Some are hard, like files being updated during the duplicate scan. Yeah..You can look at time stamps, and maybe re-compute things, but ick. If things are changing who knows what version the user wants assessed? Duplicates are probably rare. Large trees of heavily hard-linked files are probably a much smaller optimization than using file size as a weak hash to organize the calculation. You can probably find a hash function faster than your IO. Or maybe you have files on some quad m.2 NVMe raid setup delivering 13 gigabytes/sec and it is harder to match that with a crypto hash function. Almost everything you say needs a "probably/maybe" qualifier. I don't think you disagree. I'm just elaborating a little for passers by. P.S.: Two of my side-comments got auto-wrapped at 76 text columns. Should be an easy fix for anyone cutting & pasting that python code. Sorry! --0000000000001ed9e7058602ffb9--