zsh-workers
 help / color / mirror / code / Atom feed
From: Stephane Chazelas <stephane.chazelas@gmail.com>
To: Bart Schaefer <schaefer@brasslantern.com>
Cc: Zsh hackers list <zsh-workers@zsh.org>
Subject: Re: Surprising behaviour with numeric glob sort
Date: Fri, 2 Jun 2017 10:03:32 +0100	[thread overview]
Message-ID: <20170602090332.GA6574@chaz.gmail.com> (raw)
In-Reply-To: <170601152943.ZM4783@torch.brasslantern.com>

2017-06-01 15:29:43 -0700, Bart Schaefer:
> On May 31, 10:24pm, Stephane Chazelas wrote:
> }
> } Maybe a better approach would be to break down the strings
> } between non-numeric and numeric parts and use strcoll() on the
> } non-numeric and number comparison on the numeric parts, stopping
> } at the first difference.
> 
> I don't think that helps, in the general case.  It would still mean
> the sort is not stable where the numeric parts are the same but the
> non-numeric part is partially-ordered.
> 
> To stabilize the sort we'd have to, for example, replace strcoll()
> with something that falls back to byte value ordering whenever the
> collation order of two characters is equivalent, but that requires
> lookahead (doesn't work on prefixes).
[...]

Sorry, my choice of words was poor. I shouldn't have used
"total" there.

OK, in a locale where A, B and C sort the same, globbing is
non-deterministic (with or without numericglobsort, with the
current situation or with the change I propose) but is possible.

But with the comparison algorithm of zsh's current *(n) that for
some values of A,B,C, have A < B, B < C, C < A, sorting is just
not possible. Some qsort() will give one result (that doesn't
satisfy all those), some have been known to SEGV, some might
loop indefinitely. But more importantly, it gives unexpected
results in real-life cases.

In a locale where A, B and C sort the same, with
numericglobsort, A2 B10 C1 should sort as C1 A2 B10, just like
without numericglobsort it should (and does) sort as C1 B10 A2¹
or print -l A B C | sort -u would give one line (A here because
of the last-resort memcmp() comparison). I have no problem with
that. That's the intention of the collation algorithm (though I
argue those locales are broken, locale collation algorithms, at
least the system ones should have a total order, that was more
or less the conclusion of a related discussion at the
opengroup). But:

$ echo *(n)
zsh-10 zsh2 zsh10 zsh-3

(here in my en_GB.UTF-8 GNU locale)

is unexpected/broken. "zsh" sorts before "zsh-" in my locale, so
I'd expect the zsh2, zsh10 to come before zsh-3, zsh-10 which is
the basis of my proposal. In any case, zsh-3 should come before
zsh-10, nobody can argue against that.

In a locale where "zsh-" sorts the same as "zsh", *(n) currently
gives either zsh2 zsh-3 zsh10 zsh-10 or zsh2 zsh-3 zsh10 zsh-10,
both of which are fine with me. And it wouldn't change with my
proposal. It would be nice to have a consistent order, for
instance by implementing a last-resort memcmp()-based comparison
like "sort" does without -s, but that's nowhere as important a
problem as in my experience, real life file names don't have
parts that sort the same in any locale (and only GNU systems in
my experience have locales with such non-total orders, for the
most part non-intentionally like the ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩ that
sort the same in GNU locales).

¹ Note that the fact that x-A xB x-C sorts in that order in GNU
non-C locales is not because "x" sorts the same as "x-" but
because the primary weight of a "-" is "IGNORE" so when
comparing x-A and xB, strcoll() first compares "xA" with "xB".
If it was xA against x-A, then the other weights would be
considered which would sort "xA" before "x-A"


-- 
Stephane


  reply	other threads:[~2017-06-02  9:03 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-31 21:24 Stephane Chazelas
2017-06-01 22:29 ` Bart Schaefer
2017-06-02  9:03   ` Stephane Chazelas [this message]
2017-06-02 23:19     ` Bart Schaefer
2017-06-03 21:16       ` Stephane Chazelas
2017-06-04  0:07         ` Bart Schaefer
2017-06-04 17:31           ` Stephane Chazelas
2017-06-04 22:01             ` Bart Schaefer
2017-06-05 11:54               ` Stephane Chazelas
2017-06-05 19:15                 ` Stephane Chazelas
2017-06-06  3:13                 ` Bart Schaefer
2017-06-06  9:22                   ` Stephane Chazelas
2017-06-07  8:41                 ` Stephane Chazelas
2017-06-17 18:11                   ` Bart Schaefer
2017-06-06 14:44         ` Vincent Lefevre
2017-06-06 16:47           ` Stephane Chazelas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170602090332.GA6574@chaz.gmail.com \
    --to=stephane.chazelas@gmail.com \
    --cc=schaefer@brasslantern.com \
    --cc=zsh-workers@zsh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).