zsh-workers
 help / color / mirror / code / Atom feed
From: Stephane Chazelas <stephane.chazelas@gmail.com>
To: Bart Schaefer <schaefer@brasslantern.com>,
	Zsh hackers list <zsh-workers@zsh.org>
Subject: Re: Surprising behaviour with numeric glob sort
Date: Wed, 7 Jun 2017 09:41:47 +0100	[thread overview]
Message-ID: <20170607084147.GA5395@chaz.gmail.com> (raw)
In-Reply-To: <20170605115439.GA15325@chaz.gmail.com>

2017-06-05 12:54:39 +0100, Stephane Chazelas:
[...]
> You mean a Schwartzian transform
> (https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts
> on (here using $'\0zsh\0-1.2\0x' as an example):
> 
> {
>   original <= $'\0zsh\0-1.2\0x'
>   parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")},
>             {num: 1},
> 	    {string: strxfrm(".")},
> 	    {num: 2},
> 	    {string: $'\0' . strxfrm("x")}]
> }, ...
> 
> With a comparison function that does memcmp() on the "string"
> parts and a number comparison on the "num" parts?
> 
> Yes, that sounds like a good idea that trades memory for
> efficiency (by doing the strxfrm() hard work up front and only
> once), but if we're going to transform upfront anyway, the
> memory is traded off already.
> 
> Or same using zero-padding
> 
> {
>   original <= $'\0zsh\0-1.2\0x'
>   transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") .
>                  "001" .  strxfrm(".") . "002" . $'\0' . strxfrm("x")}]
> }, ...
[...]

That would not be valid I (now) believe.

The transformation done by strxfrm() is opaque. For all we now,
it could very well return:

  strxfrm("foo")    -> "455"
  strxfrm("bar")    -> "217"
  strxfrm("foobar") -> "45517"

So comparing "foo1bar" "foo6bar" "foobar" with our function
(with adapted width padding) would be memcmp()ing "4551217",
"4556217" and "45517" resulting in potentially surprising
orders.

Even only concatenating the result of several strxfrm() is
probably not valid an approach.

It's probably OK if they're interspersed with NUL bytes as that
byte can't be found in the result of strxfrm() (and sorts before
anything else for memcmp()), but even then, when doing "human
collation" sorting, it probably makes more sense to consider
them as padding (like space) than the thing that must sort
before anything else, so removing them (or replacing them with
spaces to have a deterministic order) instead of inserting them
as \0 in between several strxfrm()s is just as valid and maybe
even preferable.

I think that if I had to implement it, I'd take the lazy
approach of:
 - in the C locale, use memcmp() with sequences of digits
   replaced with their 20-width 0-padding.
 - in other locales, strip the NULs and use strcoll() with
   sequences of digits replaced with their 20-width 0-padding
   (or memcmp() of the stxfrm() of the above, but with that 20x
   factor for the padding combined with a typical 5x factor for
   strxfrm(), the memory usage penalty may be too high).

Or the first approach mentioned above even if Vincent wouldn't
like it.

-- 
Stephane


  parent reply	other threads:[~2017-06-07  8:41 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-31 21:24 Stephane Chazelas
2017-06-01 22:29 ` Bart Schaefer
2017-06-02  9:03   ` Stephane Chazelas
2017-06-02 23:19     ` Bart Schaefer
2017-06-03 21:16       ` Stephane Chazelas
2017-06-04  0:07         ` Bart Schaefer
2017-06-04 17:31           ` Stephane Chazelas
2017-06-04 22:01             ` Bart Schaefer
2017-06-05 11:54               ` Stephane Chazelas
2017-06-05 19:15                 ` Stephane Chazelas
2017-06-06  3:13                 ` Bart Schaefer
2017-06-06  9:22                   ` Stephane Chazelas
2017-06-07  8:41                 ` Stephane Chazelas [this message]
2017-06-17 18:11                   ` Bart Schaefer
2017-06-06 14:44         ` Vincent Lefevre
2017-06-06 16:47           ` Stephane Chazelas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170607084147.GA5395@chaz.gmail.com \
    --to=stephane.chazelas@gmail.com \
    --cc=schaefer@brasslantern.com \
    --cc=zsh-workers@zsh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).