Surprising behaviour with numeric glob sort

zsh-workers
 help / color / mirror / code / Atom feed

* Surprising behaviour with numeric glob sort
@ 2017-05-31 21:24 Stephane Chazelas
  2017-06-01 22:29 ` Bart Schaefer
  0 siblings, 1 reply; 16+ messages in thread
From: Stephane Chazelas @ 2017-05-31 21:24 UTC (permalink / raw)
  To: Zsh hackers list

Some odd behaviour:

$ echo *(n)
a3 a10 a-2
$ rm a10
$ echo *(n)
a-2 a3

(order of a-2 and a3 reversed after a10 is removed)

$ echo $LANG
en_GB.UTF-8
$ /lib/x86_64-linux-gnu/libc.so.6
GNU C Library (Ubuntu GLIBC 2.23-0ubuntu7) stable release version 2.23, by Roland McGrath et al.

I think the problem is that here, zsh uses a non-total order. We have:

a3 < a10 (same prefix, so numeric comparison of 3 < 10)
a10 < a-2 (as per strcoll(). - ignored in first pass. No same prefix, so no numeric comparison)
a-2 < a3 (as per strcoll(). - ignored in first pass. No same prefix, so no numeric comparison)

We've got a circle here. AFAIK, the behaviour is unspecified if
qsort() is called with a comparison function that doesn't
implement a total order, so the result could be random.

Once we remove a10, we're OK.

GNU sort -V and ls -v seem to "handle" the issue by giving up on
strcoll() which is not a lot better:

$ ls
a0  á0  a10  a-2  a3  b0
$ ls -v
a0  a3  a10  a-2  b0  á0
$ ls | sort
a0
á0
a10
a-2
a3
b0
$ ls | sort -V
a0
a3
a10
a-2
b0
á0

Maybe a better approach would be to break down the strings
between non-numeric and numeric parts and use strcoll() on the
non-numeric and number comparison on the numeric parts, stopping
at the first difference.

a3 < a-2 because a < a- as per strcoll (even though a3 > 1-2)

-- 
Stephane

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-05-31 21:24 Surprising behaviour with numeric glob sort Stephane Chazelas
@ 2017-06-01 22:29 ` Bart Schaefer
  2017-06-02  9:03   ` Stephane Chazelas
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-06-01 22:29 UTC (permalink / raw)
  To: Zsh hackers list

On May 31, 10:24pm, Stephane Chazelas wrote:
}
} Maybe a better approach would be to break down the strings
} between non-numeric and numeric parts and use strcoll() on the
} non-numeric and number comparison on the numeric parts, stopping
} at the first difference.

I don't think that helps, in the general case.  It would still mean
the sort is not stable where the numeric parts are the same but the
non-numeric part is partially-ordered.

To stabilize the sort we'd have to, for example, replace strcoll()
with something that falls back to byte value ordering whenever the
collation order of two characters is equivalent, but that requires
lookahead (doesn't work on prefixes).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-01 22:29 ` Bart Schaefer
@ 2017-06-02  9:03   ` Stephane Chazelas
  2017-06-02 23:19     ` Bart Schaefer
  0 siblings, 1 reply; 16+ messages in thread
From: Stephane Chazelas @ 2017-06-02  9:03 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2017-06-01 15:29:43 -0700, Bart Schaefer:
> On May 31, 10:24pm, Stephane Chazelas wrote:
> }
> } Maybe a better approach would be to break down the strings
> } between non-numeric and numeric parts and use strcoll() on the
> } non-numeric and number comparison on the numeric parts, stopping
> } at the first difference.
> 
> I don't think that helps, in the general case.  It would still mean
> the sort is not stable where the numeric parts are the same but the
> non-numeric part is partially-ordered.
> 
> To stabilize the sort we'd have to, for example, replace strcoll()
> with something that falls back to byte value ordering whenever the
> collation order of two characters is equivalent, but that requires
> lookahead (doesn't work on prefixes).
[...]

Sorry, my choice of words was poor. I shouldn't have used
"total" there.

OK, in a locale where A, B and C sort the same, globbing is
non-deterministic (with or without numericglobsort, with the
current situation or with the change I propose) but is possible.

But with the comparison algorithm of zsh's current *(n) that for
some values of A,B,C, have A < B, B < C, C < A, sorting is just
not possible. Some qsort() will give one result (that doesn't
satisfy all those), some have been known to SEGV, some might
loop indefinitely. But more importantly, it gives unexpected
results in real-life cases.

In a locale where A, B and C sort the same, with
numericglobsort, A2 B10 C1 should sort as C1 A2 B10, just like
without numericglobsort it should (and does) sort as C1 B10 A2¹
or print -l A B C | sort -u would give one line (A here because
of the last-resort memcmp() comparison). I have no problem with
that. That's the intention of the collation algorithm (though I
argue those locales are broken, locale collation algorithms, at
least the system ones should have a total order, that was more
or less the conclusion of a related discussion at the
opengroup). But:

$ echo *(n)
zsh-10 zsh2 zsh10 zsh-3

(here in my en_GB.UTF-8 GNU locale)

is unexpected/broken. "zsh" sorts before "zsh-" in my locale, so
I'd expect the zsh2, zsh10 to come before zsh-3, zsh-10 which is
the basis of my proposal. In any case, zsh-3 should come before
zsh-10, nobody can argue against that.

In a locale where "zsh-" sorts the same as "zsh", *(n) currently
gives either zsh2 zsh-3 zsh10 zsh-10 or zsh2 zsh-3 zsh10 zsh-10,
both of which are fine with me. And it wouldn't change with my
proposal. It would be nice to have a consistent order, for
instance by implementing a last-resort memcmp()-based comparison
like "sort" does without -s, but that's nowhere as important a
problem as in my experience, real life file names don't have
parts that sort the same in any locale (and only GNU systems in
my experience have locales with such non-total orders, for the
most part non-intentionally like the ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩ that
sort the same in GNU locales).

¹ Note that the fact that x-A xB x-C sorts in that order in GNU
non-C locales is not because "x" sorts the same as "x-" but
because the primary weight of a "-" is "IGNORE" so when
comparing x-A and xB, strcoll() first compares "xA" with "xB".
If it was xA against x-A, then the other weights would be
considered which would sort "xA" before "x-A"

-- 
Stephane

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-02  9:03   ` Stephane Chazelas
@ 2017-06-02 23:19     ` Bart Schaefer
  2017-06-03 21:16       ` Stephane Chazelas
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-06-02 23:19 UTC (permalink / raw)
  To: Zsh hackers list

On Jun 2, 10:03am, Stephane Chazelas wrote:
}
} $ echo *(n)
} zsh-10 zsh2 zsh10 zsh-3
} 
} (here in my en_GB.UTF-8 GNU locale)
} 
} is unexpected/broken. "zsh" sorts before "zsh-" in my locale, so
} I'd expect the zsh2, zsh10 to come before zsh-3, zsh-10 which is
} the basis of my proposal. In any case, zsh-3 should come before
} zsh-10, nobody can argue against that.

Well, one could argue that "-10" should be treated as negative ten
and therefore should sort before negative three, but I'm not sure
we want to get into that.

The reason you get the result above is of course because most sort
algorithms assume there is a total order and therefore that it is
not necessary to compare every possible pairing of elements.

Your proposal was
> } break down the strings
> } between non-numeric and numeric parts and use strcoll() on the
> } non-numeric and number comparison on the numeric parts

As far as I can tell that's exactly zstrbcmp in zle_tricky.c does.
zstrcmp in sort.c on the other hand first attempts strcoll and
only compares numeric parts if it can find corresponding numeric
substrings in both input strings.  That is, "zsh-3" is never
compared numerically to "zsh2" because "zsh2" and "zsh-" are
considered already to differ.

In either case, if zstrcmp or zstrbcomp find a digit, they consume
more digits until they hit a non-digit or two not-equal digits, and
then look both backward and forward for digits to calculate the
numeric value for comparison.

So I think what you propose is that when "zsh1" is found to have a
difference with "zsh-", the algorithm should look forward across
"zsh-" to find "3" and at that point end up comparing "10" to "3"?
That would lead to the order in your example becoming
    zsh2 zsh-3 zsh10 zsh-10.

However, that would also mean that in strings with different sets
of numeric substrings the numeric comparisons might be be "detected"
after different prefixes for different pairs of strings; I think
the result there might be even more confusing, but I can't come up
with a specific example.

It also means having to copy non-numeric substrings during every
comparison, so as to be able to call strcoll without modifying the
input strings.  (What's the alternative?)  This would probably make
sorting prohibitively slow.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-02 23:19     ` Bart Schaefer
@ 2017-06-03 21:16       ` Stephane Chazelas
  2017-06-04  0:07         ` Bart Schaefer
  2017-06-06 14:44         ` Vincent Lefevre
  0 siblings, 2 replies; 16+ messages in thread
From: Stephane Chazelas @ 2017-06-03 21:16 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2017-06-02 16:19:05 -0700, Bart Schaefer:
> On Jun 2, 10:03am, Stephane Chazelas wrote:
> }
> } $ echo *(n)
> } zsh-10 zsh2 zsh10 zsh-3
> } 
> } (here in my en_GB.UTF-8 GNU locale)
> } 
> } is unexpected/broken. "zsh" sorts before "zsh-" in my locale, so
> } I'd expect the zsh2, zsh10 to come before zsh-3, zsh-10 which is
> } the basis of my proposal. In any case, zsh-3 should come before
> } zsh-10, nobody can argue against that.
> 
> Well, one could argue that "-10" should be treated as negative ten
> and therefore should sort before negative three, but I'm not sure
> we want to get into that.

The (my at least) main usage for *(n) is to sort version numbers
like zsh-3.0, zsh-3.1, zsh-4. So handling negative numbers
wouldn't help in those cases.

[...]
> That is, "zsh-3" is never
> compared numerically to "zsh2" because "zsh2" and "zsh-" are
> considered already to differ.
[...]
> So I think what you propose is that when "zsh1" is found to have a
> difference with "zsh-", the algorithm should look forward across
> "zsh-" to find "3" and at that point end up comparing "10" to "3"?
> That would lead to the order in your example becoming
>     zsh2 zsh-3 zsh10 zsh-10.
[...]

No, what I propose is very simple.

When comparing "zsh-3" with "zsh2", we compare the non-numeric
prefix: "zsh-" and "zsh". And already, at that point, "zsh" is
less than "zsh-", so we stop here (zsh2 < zsh-3)

If it was

zsh-3.1 vs zsh-3

["zsh-", 3, ".", 1] vs ["zsh-", 3]

- strcoll(zsh-,  zsh-) => 0
- 3 == 3
- strcoll(".", "") => zsh-3 < zsh-3.1

Now there are some aspects of the current implementation that
one might find useful like:

$ echo *
a a-3.1 a-3+1 a-3.2 a-3+2
$ (LC_ALL=C; echo *)
a a-3+1 a-3+2 a-3.1 a-3.2
$ echo *(n)
a a-3.1 a-3+1 a-3.2 a-3+2
$ (LC_ALL=C; echo *(n))
a a-3+1 a-3+2 a-3.1 a-3.2

The fact that those "-" and "." are ignored in the first
strcoll() pass in some locales makes it for a more "numerical"
sort. Though again, it's easily broken with:

$ touch a-3.10
$ echo *(n)
a a-3.1 a-3+1 a-3.2 a-3.10 a-3+2

Ideally, we'd want to hook into the strcoll() algorithm to
introduce the numerical comparisons in there. Maybe that can be
done using zero-padding like for the above, just do a strcoll()
comparison after transformation (a sort of pre-strxfrm()) of the
strings from:

a a-3.1 a-3+1 a-3.2 a-3.10 a-3+2

to:

a
a-03.01
a-03.01
a-03+01
a-03.02
a-03.10
a-03+02

adjusting the length of the padding as needed.

The above would sort to

a
a-03.01
a-03.01
a-03+01
a-03.02
a-03+02
a-03.10

In my GNU British locale and

a
a-03+01
a-03+02
a-03.01
a-03.01
a-03.02
a-03.10

In the C locale.

-- 
Stephane

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-03 21:16       ` Stephane Chazelas
@ 2017-06-04  0:07         ` Bart Schaefer
  2017-06-04 17:31           ` Stephane Chazelas
  2017-06-06 14:44         ` Vincent Lefevre
  1 sibling, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-06-04  0:07 UTC (permalink / raw)
  To: Zsh hackers list

On Jun 3, 10:16pm, Stephane Chazelas wrote:
}
} When comparing "zsh-3" with "zsh2", we compare the non-numeric
} prefix: "zsh-" and "zsh". And already, at that point, "zsh" is
} less than "zsh-", so we stop here (zsh2 < zsh-3)

Explain how to do that without either re-implementing strcoll(),
copying the input strings into temporaries, or requiring the input
strings to be writable so that we can plug in temporary '\0' bytes
after each non-numeric substring, and at that point we can discuss
using this algorithm.  Otherwise it's going to be unusuably slow
for any moderately large input set.

Otherwise you're describing what's already done:  strcoll() is used,
and then the non-numeric prefixes of the two strings are compared,
and only if the non-numeric parts are identical is numeric comparison
applied.  (Of course this is already slightly wrong, because it means
the non-numeric parts have to be *identical* not merely collated the
same, but to fix that we're back to "re-implement strcoll()".)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-04  0:07         ` Bart Schaefer
@ 2017-06-04 17:31           ` Stephane Chazelas
  2017-06-04 22:01             ` Bart Schaefer
  0 siblings, 1 reply; 16+ messages in thread
From: Stephane Chazelas @ 2017-06-04 17:31 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2017-06-03 17:07:24 -0700, Bart Schaefer:
> On Jun 3, 10:16pm, Stephane Chazelas wrote:
> }
> } When comparing "zsh-3" with "zsh2", we compare the non-numeric
> } prefix: "zsh-" and "zsh". And already, at that point, "zsh" is
> } less than "zsh-", so we stop here (zsh2 < zsh-3)
> 
> Explain how to do that without either re-implementing strcoll(),
> copying the input strings into temporaries, or requiring the input
> strings to be writable so that we can plug in temporary '\0' bytes
> after each non-numeric substring, and at that point we can discuss
> using this algorithm.  Otherwise it's going to be unusuably slow
> for any moderately large input set.

"Slow" (though probably quite negligible compared to strcoll()
which already does things like in addition to a lot more hard
wark) but working.

Copying into temporaries (for instance allocated on stack as
twice the size of the input was what I had in mind), either in
the comparing function, or prepare the list before by linking
each string to its prepared form (though some of that
"preparation" may not be needed in the end).

I quite like the zero-padding approach myself, though if we want
to allow numbers of any width, we'd need to do two scans of the
list, once to find the widest number and a second time to pad
(and memory allocation is more complicated).

> Otherwise you're describing what's already done:  strcoll() is used,
> and then the non-numeric prefixes of the two strings are compared,
> and only if the non-numeric parts are identical is numeric comparison
> applied.  (Of course this is already slightly wrong, because it means
> the non-numeric parts have to be *identical* not merely collated the
> same, but to fix that we're back to "re-implement strcoll()".)

I don't see how that's more "re-implement strcoll" than what zsh
already does.

What do you propose?

$ print -rl -- *(n)
zsh+10
zsh2
zsh10
zsh+3

is broken (and not slightly IMO) and needs fixed.

We can already fix that numerical sort by hand with

$ n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}}
$ print -rl -- *(o+n)
zsh2
zsh+3
zsh10
zsh+10
$ (LC_ALL=C; print -rl -- *(o+n))
zsh+3
zsh+10
zsh2
zsh10

But that's obviously less "efficient" than if zsh did it
internally.

-- 
Stephane

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-04 17:31           ` Stephane Chazelas
@ 2017-06-04 22:01             ` Bart Schaefer
  2017-06-05 11:54               ` Stephane Chazelas
  0 siblings, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-06-04 22:01 UTC (permalink / raw)
  To: Zsh hackers list

On Jun 4,  6:31pm, Stephane Chazelas wrote:
}
} "Slow" (though probably quite negligible compared to strcoll()
} which already does things like in addition to a lot more hard
} wark) but working.

According to one strcoll man page I have, the right thing to do is
convert all the strings with strxfrm and then strcmp those.

It provides no advice on how to order the original array to match the
sorted result of the xfrm'd array (the transform is not reversible),
nor how to determine how much space to allocate for each transform.

Zsh has the additional complication of needing to deal with strings
having embedded '\0' bytes, which neither strcoll nor strxfrm is able
to deal with.  I'm not 100% confident that zsh's current algorithm
deals correct with this either.

A possible approach would be to pre-allocate a hash table and fill
it on the fly with keys the original strings and values the strxfrm
results.  An additional strxfrm could be called on the trailing bits
after any embedded nul.  Then sort the original array by comparing
the values in the hash.  Doesn't solve the question of how much space
to reserve, but it would allow the current algorithm for picking out
numeric substrings to be used.

For parameter array sorting we could extend the (m) flag to indicate
that this transformation is required [it already means to count byte
widths for (l), (r), etc.] so as to avoid the overhead when it is not
needed.  For globbing, we'd have to rely on something else such as
whether MULTIBYTE is set.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-04 22:01             ` Bart Schaefer
@ 2017-06-05 11:54               ` Stephane Chazelas
  2017-06-05 19:15                 ` Stephane Chazelas
                                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Stephane Chazelas @ 2017-06-05 11:54 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2017-06-04 15:01:35 -0700, Bart Schaefer:
> On Jun 4,  6:31pm, Stephane Chazelas wrote:
> }
> } "Slow" (though probably quite negligible compared to strcoll()
> } which already does things like in addition to a lot more hard
> } wark) but working.
> 
> According to one strcoll man page I have, the right thing to do is
> convert all the strings with strxfrm and then strcmp those.
> 
> It provides no advice on how to order the original array to match the
> sorted result of the xfrm'd array (the transform is not reversible),
> nor how to determine how much space to allocate for each transform.
> 
> Zsh has the additional complication of needing to deal with strings
> having embedded '\0' bytes, which neither strcoll nor strxfrm is able
> to deal with.  I'm not 100% confident that zsh's current algorithm
> deals correct with this either.

>From what I can see (by using ltrace -e strcoll), zsh removes
the NUL bytes before calling strcoll, so $'\0\0x' sorts the same
as $'\0x' or 'x'.

That's as if $'\0' had IGNORE for all weights in the collation
order which I suppose is as valid as any. Per POSIX, such input
would not be text, so "sorting" them may not be very meaningful.
Same for strings that contain byte sequences that don't form
valid characters where there's no saying what strcoll() may do
with them.

Having NUL sort before any other character would be preferable
in the C locale that sorts by code point though (like to sort
UTF-16 text based on codepoint).

> A possible approach would be to pre-allocate a hash table and fill
> it on the fly with keys the original strings and values the strxfrm
> results.  An additional strxfrm could be called on the trailing bits
> after any embedded nul.  Then sort the original array by comparing
> the values in the hash.  Doesn't solve the question of how much space
> to reserve, but it would allow the current algorithm for picking out
> numeric substrings to be used.

You mean a Schwartzian transform
(https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts
on (here using $'\0zsh\0-1.2\0x' as an example):

{
  original <= $'\0zsh\0-1.2\0x'
  parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")},
            {num: 1},
	    {string: strxfrm(".")},
	    {num: 2},
	    {string: $'\0' . strxfrm("x")}]
}, ...

With a comparison function that does memcmp() on the "string"
parts and a number comparison on the "num" parts?

Yes, that sounds like a good idea that trades memory for
efficiency (by doing the strxfrm() hard work up front and only
once), but if we're going to transform upfront anyway, the
memory is traded off already.

Or same using zero-padding

{
  original <= $'\0zsh\0-1.2\0x'
  transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") .
                 "001" .  strxfrm(".") . "002" . $'\0' . strxfrm("x")}]
}, ...

(and using memcmp() in the comparison function)

With same benefits (a-1 a+2 a-3 sorted in that order where "-"
and "+" are ignored in the first pass like without
numericglobsort) and same drawback (need to find out the right
padding width).

> For parameter array sorting we could extend the (m) flag to indicate
> that this transformation is required [it already means to count byte
> widths for (l), (r), etc.] so as to avoid the overhead when it is not
> needed.  For globbing, we'd have to rely on something else such as
> whether MULTIBYTE is set.

Note that for globbing, the "numeric" sort applies after the
"o+myfunc" or "oe:...:" transformation, so the strings to sort
on may still contain all sorts of things like NUL bytes and byte
sequences that don't form valid characters even if they were not
in the file names in the first place.

Like in a

by_content() REPLY=$mapfile[$REPLY]
echo *(no+by_content)

-- 
Stephane

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-05 11:54               ` Stephane Chazelas
@ 2017-06-05 19:15                 ` Stephane Chazelas
  2017-06-06  3:13                 ` Bart Schaefer
  2017-06-07  8:41                 ` Stephane Chazelas
  2 siblings, 0 replies; 16+ messages in thread
From: Stephane Chazelas @ 2017-06-05 19:15 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

2017-06-05 12:54:39 +0100, Stephane Chazelas:
[...]
> Having NUL sort before any other character would be preferable
> in the C locale that sorts by code point though (like to sort
> UTF-16 text based on codepoint).
[...]

Well, only for UTF-16BE. Given that UTF-16 is mostly used only
on Microsoft and as little endian, that would rarely be useful.

Still, it would be more consistent to have 0 < 1 < ... < 255
and would make sure the order is deterministic which is generally
expected of the C locale. It should be only a matter of calling
memcmp() when we detect the locale (LC_COLLATE) to be C or
POSIX.

GNU sort seems to be treating the NUL byte as the character that
sorts first and seems to be doing it using several calls to
strcoll() in non-C locales:

$ printf '%b\n' 'X\0\0A\0B' 'X\0\0A\0\0C' | ltrace -e strcoll sort
sort->strcoll("X", "X")        = 0
sort->strcoll("", "")          = 0
sort->strcoll("A", "A")        = 0
sort->strcoll("B", "")         = 1
XAC
XAB

(I'd say there's scope for optimisation there).

In the C locale, it just calls memcmp().

Cheers,
Stephane

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-05 11:54               ` Stephane Chazelas
  2017-06-05 19:15                 ` Stephane Chazelas
@ 2017-06-06  3:13                 ` Bart Schaefer
  2017-06-06  9:22                   ` Stephane Chazelas
  2017-06-07  8:41                 ` Stephane Chazelas
  2 siblings, 1 reply; 16+ messages in thread
From: Bart Schaefer @ 2017-06-06  3:13 UTC (permalink / raw)
  To: Zsh hackers list

On Jun 5, 12:54pm, Stephane Chazelas wrote:
}
} > Zsh has the additional complication of needing to deal with strings
} > having embedded '\0' bytes, which neither strcoll nor strxfrm is able
} > to deal with.  I'm not 100% confident that zsh's current algorithm
} > deals correct with this either.
} 
} From what I can see (by using ltrace -e strcoll), zsh removes
} the NUL bytes before calling strcoll, so $'\0\0x' sorts the same
} as $'\0x' or 'x'.

Like I said, I think it does this wrong.  If I'm reading the code
correctly, it first compares the strings for absolute identity while
searching for embedded nuls, and if they are identical up to the nul
it then orders the shorter string before the longer one; otherwise
it skips past the last nul and then relies on strcoll() for the rest
of both strings.  It would seem to me that the collation order should
be checked before any nul as well as after, otherwise the first loop
might conclude the strings differ when strcoll() would order them the
same.  (However, read below.)

} You mean a Schwartzian transform

Yes, much like that.  Src/sort.c already has a SortElt structure that
is used to sort metafied strings by comparing their unmetafied forms.
We only [*] need to add strxfrm() of the unmetafied strings in front
and remove strcoll() of transformed strings at comparison, and then
we're in business.  For example, following strxfrm() the assumptions
about absolute identity for nul handling suddenly become valid, so we
don't have to fix that separately -- we just have to strxfrm() all
the nul-separated substrings.

} With a comparison function that does memcmp() on the "string"
} parts and a number comparison on the "num" parts?

Equivlent to that, yes.  (I don't think zero-padding will work as we
don't know how many zeroes are needed to make the strings be the same
number of digits.)

} > For globbing, we'd have to rely on something else such as
} > whether MULTIBYTE is set.
} 
} Note that for globbing, the "numeric" sort applies after the
} "o+myfunc" or "oe:...:" transformation, so the strings to sort
} on may still contain all sorts of things

Whether there have been other globbing transforms turns out not to
matter.  The point about MULTIBYTE is that we have no glob flag we
can push around to indicate that the shell should assume there are
[not] wide characters in the comparions strings.

[*] That "only" is deceptive; it's actually a fairly hefty ask, for
reasons such as needing to handle case-insensitive comparisons too
(currently everything is forced through towlower() in that event).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-06  3:13                 ` Bart Schaefer
@ 2017-06-06  9:22                   ` Stephane Chazelas
  0 siblings, 0 replies; 16+ messages in thread
From: Stephane Chazelas @ 2017-06-06  9:22 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2017-06-05 20:13:54 -0700, Bart Schaefer:
[...]
> Like I said, I think it does this wrong.  If I'm reading the code
> correctly, it first compares the strings for absolute identity while
> searching for embedded nuls, and if they are identical up to the nul
> it then orders the shorter string before the longer one; otherwise
> it skips past the last nul and then relies on strcoll() for the rest
> of both strings.  It would seem to me that the collation order should
> be checked before any nul as well as after, otherwise the first loop
> might conclude the strings differ when strcoll() would order them the
> same.  (However, read below.)

I see, like in:

$ print -lo $'\u2461\0d' $'\u2463\0c' $'\u2460\0b' $'\u2462\0a' | tr -cd 'abcd\n'
d
c
b
a


Even though \u2460 .. \u2462 all sort the same in my locale, so
the order should be:

a
b
c
d

[...]
> (I don't think zero-padding will work as we
> don't know how many zeroes are needed to make the strings be the same
> number of digits.)

Yes, like I said, that would mean an extra scan of the whole
list to find the widest number.

Or, since most of the rest of zsh can't cope with decimal
integer numbers that are more than 19 digits, pad to 19 digits
(at the expense of memory and unnecessary byte comparisons when
it comes to comparing those large numbers of zeros), like in my
n() sorting function for *(o+n) as a replacement of *(n):
n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}}

-- 
Stephane


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-03 21:16       ` Stephane Chazelas
  2017-06-04  0:07         ` Bart Schaefer
@ 2017-06-06 14:44         ` Vincent Lefevre
  2017-06-06 16:47           ` Stephane Chazelas
  1 sibling, 1 reply; 16+ messages in thread
From: Vincent Lefevre @ 2017-06-06 14:44 UTC (permalink / raw)
  To: zsh-workers; +Cc: Bart Schaefer

On 2017-06-03 22:16:46 +0100, Stephane Chazelas wrote:
> 2017-06-02 16:19:05 -0700, Bart Schaefer:
> > Well, one could argue that "-10" should be treated as negative ten
> > and therefore should sort before negative three, but I'm not sure
> > we want to get into that.
> 
> The (my at least) main usage for *(n) is to sort version numbers
> like zsh-3.0, zsh-3.1, zsh-4. So handling negative numbers
> wouldn't help in those cases.

I often uses "-" as a separator, so that I expect "foo-1" to come
before "foo-2".

And note that in Unicode, the real minus sign is U+2212.

> When comparing "zsh-3" with "zsh2", we compare the non-numeric
> prefix: "zsh-" and "zsh". And already, at that point, "zsh" is
> less than "zsh-", so we stop here (zsh2 < zsh-3)

I don't think this is the correct method. In some locales, digits
come after "-". So, IMHO, "zsh-0" should be compared with "zsh0".

I expect numeric sort and the normal sort be equivalent when all
numbers have a single digit. Numeric sort is just a generalization
to an infinite digit set (a number being regarded as an element
of this digit set).

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-06 14:44         ` Vincent Lefevre
@ 2017-06-06 16:47           ` Stephane Chazelas
  0 siblings, 0 replies; 16+ messages in thread
From: Stephane Chazelas @ 2017-06-06 16:47 UTC (permalink / raw)
  To: zsh-workers, Bart Schaefer

2017-06-06 16:44:59 +0200, Vincent Lefevre:
> On 2017-06-03 22:16:46 +0100, Stephane Chazelas wrote:
> > 2017-06-02 16:19:05 -0700, Bart Schaefer:
> > > Well, one could argue that "-10" should be treated as negative ten
> > > and therefore should sort before negative three, but I'm not sure
> > > we want to get into that.
> > 
> > The (my at least) main usage for *(n) is to sort version numbers
> > like zsh-3.0, zsh-3.1, zsh-4. So handling negative numbers
> > wouldn't help in those cases.
> 
> I often uses "-" as a separator, so that I expect "foo-1" to come
> before "foo-2".
> 
> And note that in Unicode, the real minus sign is U+2212.

Maybe so, but in every programming language, hyphen-minus is
used instead for negation. In any case, I don't think Bart (who
brought it up in the first place) was suggesting we should
change the behaviour in that regard.

> > When comparing "zsh-3" with "zsh2", we compare the non-numeric
> > prefix: "zsh-" and "zsh". And already, at that point, "zsh" is
> > less than "zsh-", so we stop here (zsh2 < zsh-3)
> 
> I don't think this is the correct method. In some locales, digits
> come after "-". So, IMHO, "zsh-0" should be compared with "zsh0".
> 
> I expect numeric sort and the normal sort be equivalent when all
> numbers have a single digit. Numeric sort is just a generalization
> to an infinite digit set (a number being regarded as an element
> of this digit set).
[...]

Of the approaches mentioned so far, only the strcoll() (not
strxfrm()) ones with 0-padding of numbers would satisfy that. 

$ echo *
zsh0 zsh-0 zsh1 zsh-1 zsh10 zsh-10 zsh2 zsh-2
$ echo *(n)
zsh0 zsh-0 zsh1 zsh-1 zsh2 zsh10 zsh-2 zsh-10
$ n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}}
$ echo *(o+n)
zsh0 zsh-0 zsh1 zsh-1 zsh2 zsh-2 zsh10 zsh-10

The approaches that split into non-numerical and numerical parts
(even those that concatenate the result of several strxfrm()
with 0-padded numbers) would sort the above as

zsh0 zsh1 zsh2 zsh10 zsh-0 zsh-1 zsh-2 zsh-10

(personally, I prefer that order, but you might argue that I
should change the locale to C if I expect that order I suppose.
Do you have a real-life example where the other order may be
preferable?).

That 0-padding could still potentially be invalid if there were
collating elements that end in decimal digits in the locale.

Note that there are charsets like GB18030 that have characters
whose encoding ends in the 0x30 byte (the encoding of 0)

Â (\uc2) (and several thousand others) is one of them:

$ LC_ALL=zh_CN.gb18030 zsh -c 'printf "\uc2"' | hd
00000000  81 30 87 30                                       |.0.0|
00000004

My

n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}}

zsh function would not replace those 0s that are not 0s because
the 0x30 are part of other characters, but the equivalent done
in C in zsh would have to take that into account I suppose (no
walking the byte values to find ASCII decimal digits)

-- 
Stephane


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-05 11:54               ` Stephane Chazelas
  2017-06-05 19:15                 ` Stephane Chazelas
  2017-06-06  3:13                 ` Bart Schaefer
@ 2017-06-07  8:41                 ` Stephane Chazelas
  2017-06-17 18:11                   ` Bart Schaefer
  2 siblings, 1 reply; 16+ messages in thread
From: Stephane Chazelas @ 2017-06-07  8:41 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

2017-06-05 12:54:39 +0100, Stephane Chazelas:
[...]
> You mean a Schwartzian transform
> (https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts
> on (here using $'\0zsh\0-1.2\0x' as an example):
> 
> {
>   original <= $'\0zsh\0-1.2\0x'
>   parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")},
>             {num: 1},
> 	    {string: strxfrm(".")},
> 	    {num: 2},
> 	    {string: $'\0' . strxfrm("x")}]
> }, ...
> 
> With a comparison function that does memcmp() on the "string"
> parts and a number comparison on the "num" parts?
> 
> Yes, that sounds like a good idea that trades memory for
> efficiency (by doing the strxfrm() hard work up front and only
> once), but if we're going to transform upfront anyway, the
> memory is traded off already.
> 
> Or same using zero-padding
> 
> {
>   original <= $'\0zsh\0-1.2\0x'
>   transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") .
>                  "001" .  strxfrm(".") . "002" . $'\0' . strxfrm("x")}]
> }, ...
[...]

That would not be valid I (now) believe.

The transformation done by strxfrm() is opaque. For all we now,
it could very well return:

  strxfrm("foo")    -> "455"
  strxfrm("bar")    -> "217"
  strxfrm("foobar") -> "45517"

So comparing "foo1bar" "foo6bar" "foobar" with our function
(with adapted width padding) would be memcmp()ing "4551217",
"4556217" and "45517" resulting in potentially surprising
orders.

Even only concatenating the result of several strxfrm() is
probably not valid an approach.

It's probably OK if they're interspersed with NUL bytes as that
byte can't be found in the result of strxfrm() (and sorts before
anything else for memcmp()), but even then, when doing "human
collation" sorting, it probably makes more sense to consider
them as padding (like space) than the thing that must sort
before anything else, so removing them (or replacing them with
spaces to have a deterministic order) instead of inserting them
as \0 in between several strxfrm()s is just as valid and maybe
even preferable.

I think that if I had to implement it, I'd take the lazy
approach of:
 - in the C locale, use memcmp() with sequences of digits
   replaced with their 20-width 0-padding.
 - in other locales, strip the NULs and use strcoll() with
   sequences of digits replaced with their 20-width 0-padding
   (or memcmp() of the stxfrm() of the above, but with that 20x
   factor for the padding combined with a typical 5x factor for
   strxfrm(), the memory usage penalty may be too high).

Or the first approach mentioned above even if Vincent wouldn't
like it.

-- 
Stephane

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Surprising behaviour with numeric glob sort
  2017-06-07  8:41                 ` Stephane Chazelas
@ 2017-06-17 18:11                   ` Bart Schaefer
  0 siblings, 0 replies; 16+ messages in thread
From: Bart Schaefer @ 2017-06-17 18:11 UTC (permalink / raw)
  To: Zsh hackers list

On Jun 7,  9:41am, Stephane Chazelas wrote:
}
} I think that if I had to implement it, I'd take the lazy
} approach of:
}  - in the C locale, use memcmp() with sequences of digits
}    replaced with their 20-width 0-padding.
}  - in other locales, strip the NULs and use strcoll() with
}    sequences of digits replaced with their 20-width 0-padding
}    (or memcmp() of the stxfrm() of the above, but with that 20x
}    factor for the padding combined with a typical 5x factor for
}    strxfrm(), the memory usage penalty may be too high).

I looked at this for a while and decided that I have neither the free
time nor the correct runtime environment to tackle it.

By my reading of the code, the following steps are necessary:

A. Update configure.ac --
   1. check for availability of wcstrxfrm() or if not then strxfrm()
   2. check for wcstrcoll() or strcoll()
B. In sort.c:strmetasort() --
   1. If *unmetalenp is NULL, unmetafy all the strings and generate
      the corresponding lengths array
   2. Starting with the unmetafied strings, step through them wide-
      character-wise to
      a. find any null bytes
      b. find any numeric substrings
      c. peform the appropriate transformation, e.g. expansion to
         20 digits with leading zeros; this regenerates the *cmp and
         len values in each SortElt
   3. Apply any available strxfrm variant to every *cmp in SortElt
      array, updating len as needed
C. Update sort.c:eltpcmp() such that --
   1. In C locale or if there is a strxfrm or if there is no strcoll,
      apply strcmp(as->cmp, bs->cmp)
   2. Else apply available variant of strcoll(as->cmp, bs->cmp)
D. Similarly adapt zstrcmp(), which will have the side-effect of
   making the cases where it is applied a lot more computationally
   expensive because it can't precalculate like strmetasort().
E. Do all of the above while preserving various optimizations such as
   skipping step (B.2.b) when not numerically sorting, skipping (B.3)
   and using (C.1) when the original has no wide/metafied characters,
   etc.

And of course we need to decide whether the inconsistency noted by
Stephane that started this discussion is actually significant enough
to undertake this effort and eat the side-effect noted in (D) and
the memory consumption noted by Stephane.

We might want to do (A.2) even if nothing else.

Technically when we don't already have the strings unmetafied by the
caller we could do (B.1) and (B.2) simultaneously, but that adds
complexity to the code in the (B.2) loop.

Volunteers welcome, I'm dropping this as of now.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-06-17 18:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-31 21:24 Surprising behaviour with numeric glob sort Stephane Chazelas
2017-06-01 22:29 ` Bart Schaefer
2017-06-02  9:03   ` Stephane Chazelas
2017-06-02 23:19     ` Bart Schaefer
2017-06-03 21:16       ` Stephane Chazelas
2017-06-04  0:07         ` Bart Schaefer
2017-06-04 17:31           ` Stephane Chazelas
2017-06-04 22:01             ` Bart Schaefer
2017-06-05 11:54               ` Stephane Chazelas
2017-06-05 19:15                 ` Stephane Chazelas
2017-06-06  3:13                 ` Bart Schaefer
2017-06-06  9:22                   ` Stephane Chazelas
2017-06-07  8:41                 ` Stephane Chazelas
2017-06-17 18:11                   ` Bart Schaefer
2017-06-06 14:44         ` Vincent Lefevre
2017-06-06 16:47           ` Stephane Chazelas

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).