From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.2 Received: from primenet.com.au (ns1.primenet.com.au [203.24.36.2]) by inbox.vuxu.org (OpenSMTPD) with ESMTP id e922ffc3 for ; Tue, 9 Apr 2019 09:30:03 +0000 (UTC) Received: (qmail 1672 invoked by alias); 9 Apr 2019 09:29:44 -0000 Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Users List List-Post: List-Help: List-Unsubscribe: X-Seq: 23919 Received: (qmail 23055 invoked by uid 1010); 9 Apr 2019 09:29:44 -0000 X-Qmail-Scanner-Diagnostics: from mail-yb1-f193.google.com by f.primenet.com.au (envelope-from , uid 7791) with qmail-scanner-2.11 (clamdscan: 0.101.1/25412. spamassassin: 3.4.2. Clear:RC:0(209.85.219.193):SA:0(-2.3/5.0):. Processed in 1.068695 secs); 09 Apr 2019 09:29:44 -0000 X-Envelope-From: charlechaud@gmail.com X-Qmail-Scanner-Mime-Attachments: | X-Qmail-Scanner-Zip-Files: | Received-SPF: pass (ns1.primenet.com.au: SPF record at _netblocks.google.com designates 209.85.219.193 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=tmUWVVtclb4Zh0sX0LWlpwCOBuHbpmc30cyr91qwMws=; b=q/op3waRQKfGl9fq55ZnNePslVC+BoWB+Ea1UHisOgbeIx17EWYXqCd0anS5e6DpWY tqHSnwPVWjmLdXcVwbkgLhUjQtSw3nsXkmEQK32kkVEE6rx5z4oPmePt1Gvmp0ttMVgq LeeBnBBQFl4r2/ATDvdC3cjY1k2UsrJLRNTUTWWFrotASKC2VGfCD5C1It5jOm9Y05AY bIdGGXWto55eR70bJSQmjXxU2NaoFO+ki2Yw6MRS5w6pU+w5gMN66Bd12UPuqjdLg+h6 5+enpsdHlJKdVI/d6ZPBSIq9K54Avy3YmaqtP/VBTrqpa72h1pwJlkIVJOcDaGOH8GxC 8zcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=tmUWVVtclb4Zh0sX0LWlpwCOBuHbpmc30cyr91qwMws=; b=Ft/hXs0vKlYr9d2RTNC1eGme+IooBaxsfiHy3sZPZ/qAT8h5yv2Apo9yHsklwTu1jH QGfctWUCSkxTmkaMAUUmO+DRTeI6ZVI4VL9R8s8yq82j2QuR9wfFo03kPhKIDKpc/nKF uEjM/3KOnei6/wDwepyStwK0IDI28yVEPmyKzbN0ZaIsrh1Vs+Zvl6q58POi7QTpDYi9 WCfGSNxCpWghuktEFgVw0b43Rm/Gmm3t5m5fA828exrcqSbjTp2V2xDi6S3QMsBUXjg2 9+9xkxW946bhUOqGaalZalrBISmkJk815/oVjEWq7uqH+OWfTcnQEwUH7soFH5+oh+NX 9QuA== X-Gm-Message-State: APjAAAUURI6LTPNqgHFiSeCnjqNSioerGR24lckUPlaVCHs3akfkSwD3 bPlbotAIE5HDlmu/tdCBSU4KjM83KVdmyMcp0C+UbP2c X-Google-Smtp-Source: APXvYqzN8LHycVYhYgolz7F8A5+d1NNW8eyez7inn1LD6R95T/fwWyOie0Go/exR8amyremdGqZ04rj6lIl8cidwyMM= X-Received: by 2002:a25:57d6:: with SMTP id l205mr28133116ybb.399.1554802149082; Tue, 09 Apr 2019 02:29:09 -0700 (PDT) MIME-Version: 1.0 References: <86v9zrbsic.fsf@zoho.eu> <20190406130242.GA29292@trot> <86tvfb9ore.fsf@zoho.eu> <86mul2apj8.fsf@zoho.eu> <20190408143748.GA21630@trot> <391277a7-d604-4c20-a666-a2886b1d2939@eastlink.ca> In-Reply-To: From: Charles Blake Date: Tue, 9 Apr 2019 05:28:33 -0400 Message-ID: Subject: Re: find duplicate files To: Zsh Users Content-Type: multipart/alternative; boundary="000000000000cacb340586159579" --000000000000cacb340586159579 Content-Type: text/plain; charset="UTF-8" Apologies if my 70 line Python script post added to that confusion. I thought it better to include working code rather than merely a list of theoretical optimizations to consider in a pure Zsh implementation. In my experience lately, availability of Python is greater than that of either Zsh or a C compiler. So, there is also that, I suppose. In timings I just did, my Python was only 1.25x slower than 'duff' (in MAX=-1/hash everything mode). Most of the time is in the crypto hashing which the Py just calls fast impls for. So, that's not so surprising. Volodymyr's pipeline took about 20X longer and Bart's pure Zsh took 10X longer. jdupes was 20X faster than the Python but missed 87% of my dups. [ I have not tried to track down why..conceivably wonky pathnames making it abort early. That seems the most likely culprit. ] The unix duff seems better designed with a -0 option, FWIW. One other optimization (not easily exhibited without a low level lang) is to not use a cryptographic hash at all, but to use a super fast one and always do the equivalent of a "cmp" only on matching hashes (or maybe a slow hash only after a fast hash). That is the jdupes approach. I think going parallel adds more value than that, though. At least in server/laptop/desktop CPUs from around Intel's Nehalem onward, just one core has been unable to saturate DIMM BW. I usually get 2X-8X more DIMM BW going multi-core. So, even infinitely fast hashes doesn't mean parallelism could not speed things by a big factor for RAM resident or superfast IO backed sets of (equally sized) files. At those speeds, Python's cPickle BW would likely suck enough compared to hash/cmp-ing that you'd need a C-like impl to max out your performance. This may *seem* to be drifting more off topic, but actually does speak to Ray's original question. Hardware has evolved enough enough over the last 50 years that a tuned implementations from yester-decade had concerns different from a tuned implementation today. RAM is giant compared to (many) file sets, IO fast compared to RAM, CPU cores abundant. Of the 5 solutions discussed (pure Zsh, Volodymyr's, Py, duff, jdupes), only my Python one used parallelism to any good effect. Then there is variation in what optimization applies to what file sets..so, a good tool would probably provide all of them. So, even though there may be dozens to hundreds of such tools, there may well *still* be room for some new tuned impl in a fast language taking all the optimizations I've mentioned into account instead of just a subset, where any single optimization might, depending upon deployment context, make the interesting procedure "several to many" times faster. [ Maybe some Rust person already did it. They seem to have a lot of energy. ;-) ] I agree the original question from Emanuel Berg was probably just unaware of any tool, though, or asked out of curiosity. Anyway, enough about the most optimal approaches. It was probably always too large a topic. Cheers --000000000000cacb340586159579--