From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-users-return-23919-ml=inbox.vuxu.org@zsh.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham autolearn_force=no version=3.4.2
Received: from primenet.com.au (ns1.primenet.com.au [203.24.36.2])
	by inbox.vuxu.org (OpenSMTPD) with ESMTP id e922ffc3
	for <ml@inbox.vuxu.org>;
	Tue, 9 Apr 2019 09:30:03 +0000 (UTC)
Received: (qmail 1672 invoked by alias); 9 Apr 2019 09:29:44 -0000
Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm
Precedence: bulk
X-No-Archive: yes
List-Id: Zsh Users List <zsh-users.zsh.org>
List-Post: <mailto:zsh-users@zsh.org>
List-Help: <mailto:zsh-users-help@zsh.org>
List-Unsubscribe: <mailto:zsh-users-unsubscribe@zsh.org>
X-Seq: 23919
Received: (qmail 23055 invoked by uid 1010); 9 Apr 2019 09:29:44 -0000
X-Qmail-Scanner-Diagnostics: from mail-yb1-f193.google.com by f.primenet.com.au (envelope-from <charlechaud@gmail.com>, uid 7791) with qmail-scanner-2.11 
 (clamdscan: 0.101.1/25412. spamassassin: 3.4.2.  
 Clear:RC:0(209.85.219.193):SA:0(-2.3/5.0):. 
 Processed in 1.068695 secs); 09 Apr 2019 09:29:44 -0000
X-Envelope-From: charlechaud@gmail.com
X-Qmail-Scanner-Mime-Attachments: |
X-Qmail-Scanner-Zip-Files: |
Received-SPF: pass (ns1.primenet.com.au: SPF record at _netblocks.google.com designates 209.85.219.193 as permitted sender)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=tmUWVVtclb4Zh0sX0LWlpwCOBuHbpmc30cyr91qwMws=;
        b=q/op3waRQKfGl9fq55ZnNePslVC+BoWB+Ea1UHisOgbeIx17EWYXqCd0anS5e6DpWY
         tqHSnwPVWjmLdXcVwbkgLhUjQtSw3nsXkmEQK32kkVEE6rx5z4oPmePt1Gvmp0ttMVgq
         LeeBnBBQFl4r2/ATDvdC3cjY1k2UsrJLRNTUTWWFrotASKC2VGfCD5C1It5jOm9Y05AY
         bIdGGXWto55eR70bJSQmjXxU2NaoFO+ki2Yw6MRS5w6pU+w5gMN66Bd12UPuqjdLg+h6
         5+enpsdHlJKdVI/d6ZPBSIq9K54Avy3YmaqtP/VBTrqpa72h1pwJlkIVJOcDaGOH8GxC
         8zcw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=tmUWVVtclb4Zh0sX0LWlpwCOBuHbpmc30cyr91qwMws=;
        b=Ft/hXs0vKlYr9d2RTNC1eGme+IooBaxsfiHy3sZPZ/qAT8h5yv2Apo9yHsklwTu1jH
         QGfctWUCSkxTmkaMAUUmO+DRTeI6ZVI4VL9R8s8yq82j2QuR9wfFo03kPhKIDKpc/nKF
         uEjM/3KOnei6/wDwepyStwK0IDI28yVEPmyKzbN0ZaIsrh1Vs+Zvl6q58POi7QTpDYi9
         WCfGSNxCpWghuktEFgVw0b43Rm/Gmm3t5m5fA828exrcqSbjTp2V2xDi6S3QMsBUXjg2
         9+9xkxW946bhUOqGaalZalrBISmkJk815/oVjEWq7uqH+OWfTcnQEwUH7soFH5+oh+NX
         9QuA==
X-Gm-Message-State: APjAAAUURI6LTPNqgHFiSeCnjqNSioerGR24lckUPlaVCHs3akfkSwD3
	bPlbotAIE5HDlmu/tdCBSU4KjM83KVdmyMcp0C+UbP2c
X-Google-Smtp-Source: APXvYqzN8LHycVYhYgolz7F8A5+d1NNW8eyez7inn1LD6R95T/fwWyOie0Go/exR8amyremdGqZ04rj6lIl8cidwyMM=
X-Received: by 2002:a25:57d6:: with SMTP id l205mr28133116ybb.399.1554802149082;
 Tue, 09 Apr 2019 02:29:09 -0700 (PDT)
MIME-Version: 1.0
References: <86v9zrbsic.fsf@zoho.eu> <20190406130242.GA29292@trot>
 <86tvfb9ore.fsf@zoho.eu> <caaf6f13-2b09-4a91-6c24-491e8954a30a@gmail.com>
 <86mul2apj8.fsf@zoho.eu> <20190408143748.GA21630@trot> <391277a7-d604-4c20-a666-a2886b1d2939@eastlink.ca>
 <ufamul0c7f0.fsf@epithumia.math.uh.edu>
In-Reply-To: <ufamul0c7f0.fsf@epithumia.math.uh.edu>
From: Charles Blake <charlechaud@gmail.com>
Date: Tue, 9 Apr 2019 05:28:33 -0400
Message-ID: <CAKiz1a_Nk+uWh1p6VjwZqAFk3t_B4g0gzSHkXutYXQAtd3wbLw@mail.gmail.com>
Subject: Re: find duplicate files
To: Zsh Users <zsh-users@zsh.org>
Content-Type: multipart/alternative; boundary="000000000000cacb340586159579"

--000000000000cacb340586159579
Content-Type: text/plain; charset="UTF-8"

Apologies if my 70 line Python script post added to that confusion.
I thought it better to include working code rather than merely a list
of theoretical optimizations to consider in a pure Zsh implementation.
In my experience lately, availability of Python is greater than that
of either Zsh or a C compiler.  So, there is also that, I suppose.

In timings I just did, my Python was only 1.25x slower than 'duff' (in
MAX=-1/hash everything mode).  Most of the time is in the crypto hashing
which the Py just calls fast impls for.  So, that's not so surprising.
Volodymyr's pipeline took about 20X longer and Bart's pure Zsh took 10X
longer.  jdupes was 20X faster than the Python but missed 87% of my dups.
[ I have not tried to track down why..conceivably wonky pathnames making
it abort early.  That seems the most likely culprit. ] The unix duff
seems better designed with a -0 option, FWIW.

One other optimization (not easily exhibited without a low level lang) is
to not use a cryptographic hash at all, but to use a super fast one and
always do the equivalent of a "cmp" only on matching hashes (or maybe a
slow hash only after a fast hash).  That is the jdupes approach.  I think
going parallel adds more value than that, though.

At least in server/laptop/desktop CPUs from around Intel's Nehalem onward,
just one core has been unable to saturate DIMM BW.  I usually get 2X-8X
more DIMM BW going multi-core.  So, even infinitely fast hashes doesn't
mean parallelism could not speed things by a big factor for RAM resident
or superfast IO backed sets of (equally sized) files.  At those speeds,
Python's cPickle BW would likely suck enough compared to hash/cmp-ing
that you'd need a C-like impl to max out your performance.

This may *seem* to be drifting more off topic, but actually does speak to
Ray's original question.  Hardware has evolved enough enough over the
last 50 years that a tuned implementations from yester-decade had concerns
different from a tuned implementation today.  RAM is giant compared to
(many) file sets, IO fast compared to RAM, CPU cores abundant.  Of the
5 solutions discussed (pure Zsh, Volodymyr's, Py, duff, jdupes), only my
Python one used parallelism to any good effect.  Then there is variation
in what optimization applies to what file sets..so, a good tool would
probably provide all of them.

So, even though there may be dozens to hundreds of such tools, there may
well *still* be room for some new tuned impl in a fast language taking all
the optimizations I've mentioned into account instead of just a subset,
where any single optimization might, depending upon deployment context,
make the interesting procedure "several to many" times faster. [ Maybe
some Rust person already did it.  They seem to have a lot of energy. ;-) ]
I agree the original question from Emanuel Berg was probably just unaware
of any tool, though, or asked out of curiosity.  Anyway, enough about the
most optimal approaches.  It was probably always too large a topic.

Cheers

--000000000000cacb340586159579--