From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <zsh-workers-return-41228-mason-zsh=primenet.com.au@zsh.org>
Received: (qmail 29360 invoked by alias); 5 Jun 2017 11:54:53 -0000
Mailing-List: contact zsh-workers-help@zsh.org; run by ezmlm
Precedence: bulk
X-No-Archive: yes
List-Id: Zsh Workers List <zsh-workers.zsh.org>
List-Post: <mailto:zsh-workers@zsh.org>
List-Help: <mailto:zsh-workers-help@zsh.org>
X-Seq: 41228
Received: (qmail 26499 invoked from network); 5 Jun 2017 11:54:53 -0000
X-Qmail-Scanner-Diagnostics: from mail-wm0-f45.google.com by f.primenet.com.au (envelope-from <stephane.chazelas@gmail.com>, uid 7791) with qmail-scanner-2.11 
 (clamdscan: 0.99.2/21882. spamassassin: 3.4.1.  
 Clear:RC:0(74.125.82.45):SA:0(-0.0/5.0):. 
 Processed in 1.666537 secs); 05 Jun 2017 11:54:53 -0000
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on f.primenet.com.au
X-Spam-Level: 
X-Spam-Status: No, score=-0.0 required=5.0 tests=FREEMAIL_FROM,
	RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS,
	T_DKIM_INVALID autolearn=unavailable autolearn_force=no version=3.4.1
X-Envelope-From: stephane.chazelas@gmail.com
X-Qmail-Scanner-Mime-Attachments: |
X-Qmail-Scanner-Zip-Files: |
Received-SPF: pass (ns1.primenet.com.au: SPF record at _netblocks.google.com designates 74.125.82.45 as permitted sender)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=date:from:to:cc:subject:message-id:mail-followup-to:references
         :mime-version:content-disposition:in-reply-to:user-agent;
        bh=E1g9BKlreswAbv/94jLrrmyvOlli9z94wNBupB0l6Gs=;
        b=dTfboxDtI4Ro9BvKc+PJAipbKaMW2I8SJF7A02De/lVwEO94wnqOcCC/gR8Ss5NPKg
         gVUvqr4TumViTGjoXKmt9qgKKW/c4z7GKOXq/cFE4iW276kZwsGD+7+bbW7zDIo8zSzx
         IHSCENC33NU6zs47sBsPczNRQcbi/q//kMIFht+LWTGuCJxcrJ6mHlnkHy2u6LJzRJat
         ExiC3LMNLznBOEJpZsqqB5ibVfqLS5ByCPIBm3BwHjv4GvnOzcdoQOc+Awxex4ThO3MM
         WmWnfl1ebAaN6ybUycXgOSkVLDhcQb6C2w6s7l60TjnI0KsYyrAaDE/k6yejQo7DGCN5
         Jsnw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id
         :mail-followup-to:references:mime-version:content-disposition
         :in-reply-to:user-agent;
        bh=E1g9BKlreswAbv/94jLrrmyvOlli9z94wNBupB0l6Gs=;
        b=txYYTVM8Gn5WD/soHqZ8HKYf9h9cJ8ksA8LStyF9Q+Gvk+jFnEjOpQSNHTOtDYJDBl
         9RR3nMkuw/NrlyrdZlYtbRWCgX014yd0RR0QBlKY30z8DaYW1UJl0CoND4nLwT8rd733
         uNdCs85gr4uboYvEtTEPNot0OjL8qNo+/Nf/JW7g/8cFpplHG2C9YMOjxtRB0Lfew0QW
         OnmuvQKkpdWiwfctJb7zJPpbaadwprFd0MOT/rcm82CPc21nMZyrS0D/CBm8hJW+0GVH
         cQJUW7ejL41Fu24VvOPwvOvBqivzwIHB74poobhLtmCNQBa5k0k5rdnLXyjKWvt9G3p4
         VPqA==
X-Gm-Message-State: AODbwcAgjcMfKawiUnUCc/40/F0NfVNCM9zAga6RvwZNa2iafJaKjTJl
	H8adk/9jlgIe509g
X-Received: by 10.28.54.154 with SMTP id y26mr7249340wmh.53.1496663682905;
        Mon, 05 Jun 2017 04:54:42 -0700 (PDT)
Date: Mon, 5 Jun 2017 12:54:39 +0100
From: Stephane Chazelas <stephane.chazelas@gmail.com>
To: Bart Schaefer <schaefer@brasslantern.com>
Cc: Zsh hackers list <zsh-workers@zsh.org>
Subject: Re: Surprising behaviour with numeric glob sort
Message-ID: <20170605115439.GA15325@chaz.gmail.com>
Mail-Followup-To: Bart Schaefer <schaefer@brasslantern.com>,
	Zsh hackers list <zsh-workers@zsh.org>
References: <20170531212453.GA31563@chaz.gmail.com>
 <170601152943.ZM4783@torch.brasslantern.com>
 <20170602090332.GA6574@chaz.gmail.com>
 <170602161905.ZM10488@torch.brasslantern.com>
 <20170603211645.GA17785@chaz.gmail.com>
 <170603170724.ZM15645@torch.brasslantern.com>
 <20170604173157.GB9094@chaz.gmail.com>
 <170604150135.ZM13291@torch.brasslantern.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <170604150135.ZM13291@torch.brasslantern.com>
User-Agent: Mutt/1.5.24 (2015-08-30)

2017-06-04 15:01:35 -0700, Bart Schaefer:
> On Jun 4,  6:31pm, Stephane Chazelas wrote:
> }
> } "Slow" (though probably quite negligible compared to strcoll()
> } which already does things like in addition to a lot more hard
> } wark) but working.
> 
> According to one strcoll man page I have, the right thing to do is
> convert all the strings with strxfrm and then strcmp those.
> 
> It provides no advice on how to order the original array to match the
> sorted result of the xfrm'd array (the transform is not reversible),
> nor how to determine how much space to allocate for each transform.
> 
> Zsh has the additional complication of needing to deal with strings
> having embedded '\0' bytes, which neither strcoll nor strxfrm is able
> to deal with.  I'm not 100% confident that zsh's current algorithm
> deals correct with this either.

>>From what I can see (by using ltrace -e strcoll), zsh removes
the NUL bytes before calling strcoll, so $'\0\0x' sorts the same
as $'\0x' or 'x'.

That's as if $'\0' had IGNORE for all weights in the collation
order which I suppose is as valid as any. Per POSIX, such input
would not be text, so "sorting" them may not be very meaningful.
Same for strings that contain byte sequences that don't form
valid characters where there's no saying what strcoll() may do
with them.

Having NUL sort before any other character would be preferable
in the C locale that sorts by code point though (like to sort
UTF-16 text based on codepoint).

> A possible approach would be to pre-allocate a hash table and fill
> it on the fly with keys the original strings and values the strxfrm
> results.  An additional strxfrm could be called on the trailing bits
> after any embedded nul.  Then sort the original array by comparing
> the values in the hash.  Doesn't solve the question of how much space
> to reserve, but it would allow the current algorithm for picking out
> numeric substrings to be used.

You mean a Schwartzian transform
(https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts
on (here using $'\0zsh\0-1.2\0x' as an example):

{
  original <= $'\0zsh\0-1.2\0x'
  parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")},
            {num: 1},
	    {string: strxfrm(".")},
	    {num: 2},
	    {string: $'\0' . strxfrm("x")}]
}, ...

With a comparison function that does memcmp() on the "string"
parts and a number comparison on the "num" parts?

Yes, that sounds like a good idea that trades memory for
efficiency (by doing the strxfrm() hard work up front and only
once), but if we're going to transform upfront anyway, the
memory is traded off already.

Or same using zero-padding

{
  original <= $'\0zsh\0-1.2\0x'
  transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") .
                 "001" .  strxfrm(".") . "002" . $'\0' . strxfrm("x")}]
}, ...

(and using memcmp() in the comparison function)

With same benefits (a-1 a+2 a-3 sorted in that order where "-"
and "+" are ignored in the first pass like without
numericglobsort) and same drawback (need to find out the right
padding width).

> For parameter array sorting we could extend the (m) flag to indicate
> that this transformation is required [it already means to count byte
> widths for (l), (r), etc.] so as to avoid the overhead when it is not
> needed.  For globbing, we'd have to rely on something else such as
> whether MULTIBYTE is set.

Note that for globbing, the "numeric" sort applies after the
"o+myfunc" or "oe:...:" transformation, so the strings to sort
on may still contain all sorts of things like NUL bytes and byte
sequences that don't form valid characters even if they were not
in the file names in the first place.

Like in a

by_content() REPLY=$mapfile[$REPLY]
echo *(no+by_content)

-- 
Stephane