* Surprising behaviour with numeric glob sort @ 2017-05-31 21:24 Stephane Chazelas 2017-06-01 22:29 ` Bart Schaefer 0 siblings, 1 reply; 16+ messages in thread From: Stephane Chazelas @ 2017-05-31 21:24 UTC (permalink / raw) To: Zsh hackers list Some odd behaviour: $ echo *(n) a3 a10 a-2 $ rm a10 $ echo *(n) a-2 a3 (order of a-2 and a3 reversed after a10 is removed) $ echo $LANG en_GB.UTF-8 $ /lib/x86_64-linux-gnu/libc.so.6 GNU C Library (Ubuntu GLIBC 2.23-0ubuntu7) stable release version 2.23, by Roland McGrath et al. I think the problem is that here, zsh uses a non-total order. We have: a3 < a10 (same prefix, so numeric comparison of 3 < 10) a10 < a-2 (as per strcoll(). - ignored in first pass. No same prefix, so no numeric comparison) a-2 < a3 (as per strcoll(). - ignored in first pass. No same prefix, so no numeric comparison) We've got a circle here. AFAIK, the behaviour is unspecified if qsort() is called with a comparison function that doesn't implement a total order, so the result could be random. Once we remove a10, we're OK. GNU sort -V and ls -v seem to "handle" the issue by giving up on strcoll() which is not a lot better: $ ls a0 á0 a10 a-2 a3 b0 $ ls -v a0 a3 a10 a-2 b0 á0 $ ls | sort a0 á0 a10 a-2 a3 b0 $ ls | sort -V a0 a3 a10 a-2 b0 á0 Maybe a better approach would be to break down the strings between non-numeric and numeric parts and use strcoll() on the non-numeric and number comparison on the numeric parts, stopping at the first difference. a3 < a-2 because a < a- as per strcoll (even though a3 > 1-2) -- Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-05-31 21:24 Surprising behaviour with numeric glob sort Stephane Chazelas @ 2017-06-01 22:29 ` Bart Schaefer 2017-06-02 9:03 ` Stephane Chazelas 0 siblings, 1 reply; 16+ messages in thread From: Bart Schaefer @ 2017-06-01 22:29 UTC (permalink / raw) To: Zsh hackers list On May 31, 10:24pm, Stephane Chazelas wrote: } } Maybe a better approach would be to break down the strings } between non-numeric and numeric parts and use strcoll() on the } non-numeric and number comparison on the numeric parts, stopping } at the first difference. I don't think that helps, in the general case. It would still mean the sort is not stable where the numeric parts are the same but the non-numeric part is partially-ordered. To stabilize the sort we'd have to, for example, replace strcoll() with something that falls back to byte value ordering whenever the collation order of two characters is equivalent, but that requires lookahead (doesn't work on prefixes). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-01 22:29 ` Bart Schaefer @ 2017-06-02 9:03 ` Stephane Chazelas 2017-06-02 23:19 ` Bart Schaefer 0 siblings, 1 reply; 16+ messages in thread From: Stephane Chazelas @ 2017-06-02 9:03 UTC (permalink / raw) To: Bart Schaefer; +Cc: Zsh hackers list 2017-06-01 15:29:43 -0700, Bart Schaefer: > On May 31, 10:24pm, Stephane Chazelas wrote: > } > } Maybe a better approach would be to break down the strings > } between non-numeric and numeric parts and use strcoll() on the > } non-numeric and number comparison on the numeric parts, stopping > } at the first difference. > > I don't think that helps, in the general case. It would still mean > the sort is not stable where the numeric parts are the same but the > non-numeric part is partially-ordered. > > To stabilize the sort we'd have to, for example, replace strcoll() > with something that falls back to byte value ordering whenever the > collation order of two characters is equivalent, but that requires > lookahead (doesn't work on prefixes). [...] Sorry, my choice of words was poor. I shouldn't have used "total" there. OK, in a locale where A, B and C sort the same, globbing is non-deterministic (with or without numericglobsort, with the current situation or with the change I propose) but is possible. But with the comparison algorithm of zsh's current *(n) that for some values of A,B,C, have A < B, B < C, C < A, sorting is just not possible. Some qsort() will give one result (that doesn't satisfy all those), some have been known to SEGV, some might loop indefinitely. But more importantly, it gives unexpected results in real-life cases. In a locale where A, B and C sort the same, with numericglobsort, A2 B10 C1 should sort as C1 A2 B10, just like without numericglobsort it should (and does) sort as C1 B10 A2¹ or print -l A B C | sort -u would give one line (A here because of the last-resort memcmp() comparison). I have no problem with that. That's the intention of the collation algorithm (though I argue those locales are broken, locale collation algorithms, at least the system ones should have a total order, that was more or less the conclusion of a related discussion at the opengroup). But: $ echo *(n) zsh-10 zsh2 zsh10 zsh-3 (here in my en_GB.UTF-8 GNU locale) is unexpected/broken. "zsh" sorts before "zsh-" in my locale, so I'd expect the zsh2, zsh10 to come before zsh-3, zsh-10 which is the basis of my proposal. In any case, zsh-3 should come before zsh-10, nobody can argue against that. In a locale where "zsh-" sorts the same as "zsh", *(n) currently gives either zsh2 zsh-3 zsh10 zsh-10 or zsh2 zsh-3 zsh10 zsh-10, both of which are fine with me. And it wouldn't change with my proposal. It would be nice to have a consistent order, for instance by implementing a last-resort memcmp()-based comparison like "sort" does without -s, but that's nowhere as important a problem as in my experience, real life file names don't have parts that sort the same in any locale (and only GNU systems in my experience have locales with such non-total orders, for the most part non-intentionally like the ① ② ③ ④ ⑤ ⑥ ⑦ ⑧ ⑨ ⑩ that sort the same in GNU locales). ¹ Note that the fact that x-A xB x-C sorts in that order in GNU non-C locales is not because "x" sorts the same as "x-" but because the primary weight of a "-" is "IGNORE" so when comparing x-A and xB, strcoll() first compares "xA" with "xB". If it was xA against x-A, then the other weights would be considered which would sort "xA" before "x-A" -- Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-02 9:03 ` Stephane Chazelas @ 2017-06-02 23:19 ` Bart Schaefer 2017-06-03 21:16 ` Stephane Chazelas 0 siblings, 1 reply; 16+ messages in thread From: Bart Schaefer @ 2017-06-02 23:19 UTC (permalink / raw) To: Zsh hackers list On Jun 2, 10:03am, Stephane Chazelas wrote: } } $ echo *(n) } zsh-10 zsh2 zsh10 zsh-3 } } (here in my en_GB.UTF-8 GNU locale) } } is unexpected/broken. "zsh" sorts before "zsh-" in my locale, so } I'd expect the zsh2, zsh10 to come before zsh-3, zsh-10 which is } the basis of my proposal. In any case, zsh-3 should come before } zsh-10, nobody can argue against that. Well, one could argue that "-10" should be treated as negative ten and therefore should sort before negative three, but I'm not sure we want to get into that. The reason you get the result above is of course because most sort algorithms assume there is a total order and therefore that it is not necessary to compare every possible pairing of elements. Your proposal was > } break down the strings > } between non-numeric and numeric parts and use strcoll() on the > } non-numeric and number comparison on the numeric parts As far as I can tell that's exactly zstrbcmp in zle_tricky.c does. zstrcmp in sort.c on the other hand first attempts strcoll and only compares numeric parts if it can find corresponding numeric substrings in both input strings. That is, "zsh-3" is never compared numerically to "zsh2" because "zsh2" and "zsh-" are considered already to differ. In either case, if zstrcmp or zstrbcomp find a digit, they consume more digits until they hit a non-digit or two not-equal digits, and then look both backward and forward for digits to calculate the numeric value for comparison. So I think what you propose is that when "zsh1" is found to have a difference with "zsh-", the algorithm should look forward across "zsh-" to find "3" and at that point end up comparing "10" to "3"? That would lead to the order in your example becoming zsh2 zsh-3 zsh10 zsh-10. However, that would also mean that in strings with different sets of numeric substrings the numeric comparisons might be be "detected" after different prefixes for different pairs of strings; I think the result there might be even more confusing, but I can't come up with a specific example. It also means having to copy non-numeric substrings during every comparison, so as to be able to call strcoll without modifying the input strings. (What's the alternative?) This would probably make sorting prohibitively slow. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-02 23:19 ` Bart Schaefer @ 2017-06-03 21:16 ` Stephane Chazelas 2017-06-04 0:07 ` Bart Schaefer 2017-06-06 14:44 ` Vincent Lefevre 0 siblings, 2 replies; 16+ messages in thread From: Stephane Chazelas @ 2017-06-03 21:16 UTC (permalink / raw) To: Bart Schaefer; +Cc: Zsh hackers list 2017-06-02 16:19:05 -0700, Bart Schaefer: > On Jun 2, 10:03am, Stephane Chazelas wrote: > } > } $ echo *(n) > } zsh-10 zsh2 zsh10 zsh-3 > } > } (here in my en_GB.UTF-8 GNU locale) > } > } is unexpected/broken. "zsh" sorts before "zsh-" in my locale, so > } I'd expect the zsh2, zsh10 to come before zsh-3, zsh-10 which is > } the basis of my proposal. In any case, zsh-3 should come before > } zsh-10, nobody can argue against that. > > Well, one could argue that "-10" should be treated as negative ten > and therefore should sort before negative three, but I'm not sure > we want to get into that. The (my at least) main usage for *(n) is to sort version numbers like zsh-3.0, zsh-3.1, zsh-4. So handling negative numbers wouldn't help in those cases. [...] > That is, "zsh-3" is never > compared numerically to "zsh2" because "zsh2" and "zsh-" are > considered already to differ. [...] > So I think what you propose is that when "zsh1" is found to have a > difference with "zsh-", the algorithm should look forward across > "zsh-" to find "3" and at that point end up comparing "10" to "3"? > That would lead to the order in your example becoming > zsh2 zsh-3 zsh10 zsh-10. [...] No, what I propose is very simple. When comparing "zsh-3" with "zsh2", we compare the non-numeric prefix: "zsh-" and "zsh". And already, at that point, "zsh" is less than "zsh-", so we stop here (zsh2 < zsh-3) If it was zsh-3.1 vs zsh-3 ["zsh-", 3, ".", 1] vs ["zsh-", 3] - strcoll(zsh-, zsh-) => 0 - 3 == 3 - strcoll(".", "") => zsh-3 < zsh-3.1 Now there are some aspects of the current implementation that one might find useful like: $ echo * a a-3.1 a-3+1 a-3.2 a-3+2 $ (LC_ALL=C; echo *) a a-3+1 a-3+2 a-3.1 a-3.2 $ echo *(n) a a-3.1 a-3+1 a-3.2 a-3+2 $ (LC_ALL=C; echo *(n)) a a-3+1 a-3+2 a-3.1 a-3.2 The fact that those "-" and "." are ignored in the first strcoll() pass in some locales makes it for a more "numerical" sort. Though again, it's easily broken with: $ touch a-3.10 $ echo *(n) a a-3.1 a-3+1 a-3.2 a-3.10 a-3+2 Ideally, we'd want to hook into the strcoll() algorithm to introduce the numerical comparisons in there. Maybe that can be done using zero-padding like for the above, just do a strcoll() comparison after transformation (a sort of pre-strxfrm()) of the strings from: a a-3.1 a-3+1 a-3.2 a-3.10 a-3+2 to: a a-03.01 a-03.01 a-03+01 a-03.02 a-03.10 a-03+02 adjusting the length of the padding as needed. The above would sort to a a-03.01 a-03.01 a-03+01 a-03.02 a-03+02 a-03.10 In my GNU British locale and a a-03+01 a-03+02 a-03.01 a-03.01 a-03.02 a-03.10 In the C locale. -- Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-03 21:16 ` Stephane Chazelas @ 2017-06-04 0:07 ` Bart Schaefer 2017-06-04 17:31 ` Stephane Chazelas 2017-06-06 14:44 ` Vincent Lefevre 1 sibling, 1 reply; 16+ messages in thread From: Bart Schaefer @ 2017-06-04 0:07 UTC (permalink / raw) To: Zsh hackers list On Jun 3, 10:16pm, Stephane Chazelas wrote: } } When comparing "zsh-3" with "zsh2", we compare the non-numeric } prefix: "zsh-" and "zsh". And already, at that point, "zsh" is } less than "zsh-", so we stop here (zsh2 < zsh-3) Explain how to do that without either re-implementing strcoll(), copying the input strings into temporaries, or requiring the input strings to be writable so that we can plug in temporary '\0' bytes after each non-numeric substring, and at that point we can discuss using this algorithm. Otherwise it's going to be unusuably slow for any moderately large input set. Otherwise you're describing what's already done: strcoll() is used, and then the non-numeric prefixes of the two strings are compared, and only if the non-numeric parts are identical is numeric comparison applied. (Of course this is already slightly wrong, because it means the non-numeric parts have to be *identical* not merely collated the same, but to fix that we're back to "re-implement strcoll()".) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-04 0:07 ` Bart Schaefer @ 2017-06-04 17:31 ` Stephane Chazelas 2017-06-04 22:01 ` Bart Schaefer 0 siblings, 1 reply; 16+ messages in thread From: Stephane Chazelas @ 2017-06-04 17:31 UTC (permalink / raw) To: Bart Schaefer; +Cc: Zsh hackers list 2017-06-03 17:07:24 -0700, Bart Schaefer: > On Jun 3, 10:16pm, Stephane Chazelas wrote: > } > } When comparing "zsh-3" with "zsh2", we compare the non-numeric > } prefix: "zsh-" and "zsh". And already, at that point, "zsh" is > } less than "zsh-", so we stop here (zsh2 < zsh-3) > > Explain how to do that without either re-implementing strcoll(), > copying the input strings into temporaries, or requiring the input > strings to be writable so that we can plug in temporary '\0' bytes > after each non-numeric substring, and at that point we can discuss > using this algorithm. Otherwise it's going to be unusuably slow > for any moderately large input set. "Slow" (though probably quite negligible compared to strcoll() which already does things like in addition to a lot more hard wark) but working. Copying into temporaries (for instance allocated on stack as twice the size of the input was what I had in mind), either in the comparing function, or prepare the list before by linking each string to its prepared form (though some of that "preparation" may not be needed in the end). I quite like the zero-padding approach myself, though if we want to allow numbers of any width, we'd need to do two scans of the list, once to find the widest number and a second time to pad (and memory allocation is more complicated). > Otherwise you're describing what's already done: strcoll() is used, > and then the non-numeric prefixes of the two strings are compared, > and only if the non-numeric parts are identical is numeric comparison > applied. (Of course this is already slightly wrong, because it means > the non-numeric parts have to be *identical* not merely collated the > same, but to fix that we're back to "re-implement strcoll()".) I don't see how that's more "re-implement strcoll" than what zsh already does. What do you propose? $ print -rl -- *(n) zsh+10 zsh2 zsh10 zsh+3 is broken (and not slightly IMO) and needs fixed. We can already fix that numerical sort by hand with $ n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}} $ print -rl -- *(o+n) zsh2 zsh+3 zsh10 zsh+10 $ (LC_ALL=C; print -rl -- *(o+n)) zsh+3 zsh+10 zsh2 zsh10 But that's obviously less "efficient" than if zsh did it internally. -- Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-04 17:31 ` Stephane Chazelas @ 2017-06-04 22:01 ` Bart Schaefer 2017-06-05 11:54 ` Stephane Chazelas 0 siblings, 1 reply; 16+ messages in thread From: Bart Schaefer @ 2017-06-04 22:01 UTC (permalink / raw) To: Zsh hackers list On Jun 4, 6:31pm, Stephane Chazelas wrote: } } "Slow" (though probably quite negligible compared to strcoll() } which already does things like in addition to a lot more hard } wark) but working. According to one strcoll man page I have, the right thing to do is convert all the strings with strxfrm and then strcmp those. It provides no advice on how to order the original array to match the sorted result of the xfrm'd array (the transform is not reversible), nor how to determine how much space to allocate for each transform. Zsh has the additional complication of needing to deal with strings having embedded '\0' bytes, which neither strcoll nor strxfrm is able to deal with. I'm not 100% confident that zsh's current algorithm deals correct with this either. A possible approach would be to pre-allocate a hash table and fill it on the fly with keys the original strings and values the strxfrm results. An additional strxfrm could be called on the trailing bits after any embedded nul. Then sort the original array by comparing the values in the hash. Doesn't solve the question of how much space to reserve, but it would allow the current algorithm for picking out numeric substrings to be used. For parameter array sorting we could extend the (m) flag to indicate that this transformation is required [it already means to count byte widths for (l), (r), etc.] so as to avoid the overhead when it is not needed. For globbing, we'd have to rely on something else such as whether MULTIBYTE is set. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-04 22:01 ` Bart Schaefer @ 2017-06-05 11:54 ` Stephane Chazelas 2017-06-05 19:15 ` Stephane Chazelas ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Stephane Chazelas @ 2017-06-05 11:54 UTC (permalink / raw) To: Bart Schaefer; +Cc: Zsh hackers list 2017-06-04 15:01:35 -0700, Bart Schaefer: > On Jun 4, 6:31pm, Stephane Chazelas wrote: > } > } "Slow" (though probably quite negligible compared to strcoll() > } which already does things like in addition to a lot more hard > } wark) but working. > > According to one strcoll man page I have, the right thing to do is > convert all the strings with strxfrm and then strcmp those. > > It provides no advice on how to order the original array to match the > sorted result of the xfrm'd array (the transform is not reversible), > nor how to determine how much space to allocate for each transform. > > Zsh has the additional complication of needing to deal with strings > having embedded '\0' bytes, which neither strcoll nor strxfrm is able > to deal with. I'm not 100% confident that zsh's current algorithm > deals correct with this either. >From what I can see (by using ltrace -e strcoll), zsh removes the NUL bytes before calling strcoll, so $'\0\0x' sorts the same as $'\0x' or 'x'. That's as if $'\0' had IGNORE for all weights in the collation order which I suppose is as valid as any. Per POSIX, such input would not be text, so "sorting" them may not be very meaningful. Same for strings that contain byte sequences that don't form valid characters where there's no saying what strcoll() may do with them. Having NUL sort before any other character would be preferable in the C locale that sorts by code point though (like to sort UTF-16 text based on codepoint). > A possible approach would be to pre-allocate a hash table and fill > it on the fly with keys the original strings and values the strxfrm > results. An additional strxfrm could be called on the trailing bits > after any embedded nul. Then sort the original array by comparing > the values in the hash. Doesn't solve the question of how much space > to reserve, but it would allow the current algorithm for picking out > numeric substrings to be used. You mean a Schwartzian transform (https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts on (here using $'\0zsh\0-1.2\0x' as an example): { original <= $'\0zsh\0-1.2\0x' parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")}, {num: 1}, {string: strxfrm(".")}, {num: 2}, {string: $'\0' . strxfrm("x")}] }, ... With a comparison function that does memcmp() on the "string" parts and a number comparison on the "num" parts? Yes, that sounds like a good idea that trades memory for efficiency (by doing the strxfrm() hard work up front and only once), but if we're going to transform upfront anyway, the memory is traded off already. Or same using zero-padding { original <= $'\0zsh\0-1.2\0x' transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") . "001" . strxfrm(".") . "002" . $'\0' . strxfrm("x")}] }, ... (and using memcmp() in the comparison function) With same benefits (a-1 a+2 a-3 sorted in that order where "-" and "+" are ignored in the first pass like without numericglobsort) and same drawback (need to find out the right padding width). > For parameter array sorting we could extend the (m) flag to indicate > that this transformation is required [it already means to count byte > widths for (l), (r), etc.] so as to avoid the overhead when it is not > needed. For globbing, we'd have to rely on something else such as > whether MULTIBYTE is set. Note that for globbing, the "numeric" sort applies after the "o+myfunc" or "oe:...:" transformation, so the strings to sort on may still contain all sorts of things like NUL bytes and byte sequences that don't form valid characters even if they were not in the file names in the first place. Like in a by_content() REPLY=$mapfile[$REPLY] echo *(no+by_content) -- Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-05 11:54 ` Stephane Chazelas @ 2017-06-05 19:15 ` Stephane Chazelas 2017-06-06 3:13 ` Bart Schaefer 2017-06-07 8:41 ` Stephane Chazelas 2 siblings, 0 replies; 16+ messages in thread From: Stephane Chazelas @ 2017-06-05 19:15 UTC (permalink / raw) To: Bart Schaefer, Zsh hackers list 2017-06-05 12:54:39 +0100, Stephane Chazelas: [...] > Having NUL sort before any other character would be preferable > in the C locale that sorts by code point though (like to sort > UTF-16 text based on codepoint). [...] Well, only for UTF-16BE. Given that UTF-16 is mostly used only on Microsoft and as little endian, that would rarely be useful. Still, it would be more consistent to have 0 < 1 < ... < 255 and would make sure the order is deterministic which is generally expected of the C locale. It should be only a matter of calling memcmp() when we detect the locale (LC_COLLATE) to be C or POSIX. GNU sort seems to be treating the NUL byte as the character that sorts first and seems to be doing it using several calls to strcoll() in non-C locales: $ printf '%b\n' 'X\0\0A\0B' 'X\0\0A\0\0C' | ltrace -e strcoll sort sort->strcoll("X", "X") = 0 sort->strcoll("", "") = 0 sort->strcoll("A", "A") = 0 sort->strcoll("B", "") = 1 XAC XAB (I'd say there's scope for optimisation there). In the C locale, it just calls memcmp(). Cheers, Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-05 11:54 ` Stephane Chazelas 2017-06-05 19:15 ` Stephane Chazelas @ 2017-06-06 3:13 ` Bart Schaefer 2017-06-06 9:22 ` Stephane Chazelas 2017-06-07 8:41 ` Stephane Chazelas 2 siblings, 1 reply; 16+ messages in thread From: Bart Schaefer @ 2017-06-06 3:13 UTC (permalink / raw) To: Zsh hackers list On Jun 5, 12:54pm, Stephane Chazelas wrote: } } > Zsh has the additional complication of needing to deal with strings } > having embedded '\0' bytes, which neither strcoll nor strxfrm is able } > to deal with. I'm not 100% confident that zsh's current algorithm } > deals correct with this either. } } From what I can see (by using ltrace -e strcoll), zsh removes } the NUL bytes before calling strcoll, so $'\0\0x' sorts the same } as $'\0x' or 'x'. Like I said, I think it does this wrong. If I'm reading the code correctly, it first compares the strings for absolute identity while searching for embedded nuls, and if they are identical up to the nul it then orders the shorter string before the longer one; otherwise it skips past the last nul and then relies on strcoll() for the rest of both strings. It would seem to me that the collation order should be checked before any nul as well as after, otherwise the first loop might conclude the strings differ when strcoll() would order them the same. (However, read below.) } You mean a Schwartzian transform Yes, much like that. Src/sort.c already has a SortElt structure that is used to sort metafied strings by comparing their unmetafied forms. We only [*] need to add strxfrm() of the unmetafied strings in front and remove strcoll() of transformed strings at comparison, and then we're in business. For example, following strxfrm() the assumptions about absolute identity for nul handling suddenly become valid, so we don't have to fix that separately -- we just have to strxfrm() all the nul-separated substrings. } With a comparison function that does memcmp() on the "string" } parts and a number comparison on the "num" parts? Equivlent to that, yes. (I don't think zero-padding will work as we don't know how many zeroes are needed to make the strings be the same number of digits.) } > For globbing, we'd have to rely on something else such as } > whether MULTIBYTE is set. } } Note that for globbing, the "numeric" sort applies after the } "o+myfunc" or "oe:...:" transformation, so the strings to sort } on may still contain all sorts of things Whether there have been other globbing transforms turns out not to matter. The point about MULTIBYTE is that we have no glob flag we can push around to indicate that the shell should assume there are [not] wide characters in the comparions strings. [*] That "only" is deceptive; it's actually a fairly hefty ask, for reasons such as needing to handle case-insensitive comparisons too (currently everything is forced through towlower() in that event). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-06 3:13 ` Bart Schaefer @ 2017-06-06 9:22 ` Stephane Chazelas 0 siblings, 0 replies; 16+ messages in thread From: Stephane Chazelas @ 2017-06-06 9:22 UTC (permalink / raw) To: Bart Schaefer; +Cc: Zsh hackers list 2017-06-05 20:13:54 -0700, Bart Schaefer: [...] > Like I said, I think it does this wrong. If I'm reading the code > correctly, it first compares the strings for absolute identity while > searching for embedded nuls, and if they are identical up to the nul > it then orders the shorter string before the longer one; otherwise > it skips past the last nul and then relies on strcoll() for the rest > of both strings. It would seem to me that the collation order should > be checked before any nul as well as after, otherwise the first loop > might conclude the strings differ when strcoll() would order them the > same. (However, read below.) I see, like in: $ print -lo $'\u2461\0d' $'\u2463\0c' $'\u2460\0b' $'\u2462\0a' | tr -cd 'abcd\n' d c b a Even though \u2460 .. \u2462 all sort the same in my locale, so the order should be: a b c d [...] > (I don't think zero-padding will work as we > don't know how many zeroes are needed to make the strings be the same > number of digits.) Yes, like I said, that would mean an extra scan of the whole list to find the widest number. Or, since most of the rest of zsh can't cope with decimal integer numbers that are more than 19 digits, pad to 19 digits (at the expense of memory and unnecessary byte comparisons when it comes to comparing those large numbers of zeros), like in my n() sorting function for *(o+n) as a replacement of *(n): n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}} -- Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-05 11:54 ` Stephane Chazelas 2017-06-05 19:15 ` Stephane Chazelas 2017-06-06 3:13 ` Bart Schaefer @ 2017-06-07 8:41 ` Stephane Chazelas 2017-06-17 18:11 ` Bart Schaefer 2 siblings, 1 reply; 16+ messages in thread From: Stephane Chazelas @ 2017-06-07 8:41 UTC (permalink / raw) To: Bart Schaefer, Zsh hackers list 2017-06-05 12:54:39 +0100, Stephane Chazelas: [...] > You mean a Schwartzian transform > (https://en.wikipedia.org/wiki/Schwartzian_transform) that sorts > on (here using $'\0zsh\0-1.2\0x' as an example): > > { > original <= $'\0zsh\0-1.2\0x' > parts <= [{string: $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-")}, > {num: 1}, > {string: strxfrm(".")}, > {num: 2}, > {string: $'\0' . strxfrm("x")}] > }, ... > > With a comparison function that does memcmp() on the "string" > parts and a number comparison on the "num" parts? > > Yes, that sounds like a good idea that trades memory for > efficiency (by doing the strxfrm() hard work up front and only > once), but if we're going to transform upfront anyway, the > memory is traded off already. > > Or same using zero-padding > > { > original <= $'\0zsh\0-1.2\0x' > transformed <= $'\0' . strxfrm("zsh") . $'\0' . strxfrm("-") . > "001" . strxfrm(".") . "002" . $'\0' . strxfrm("x")}] > }, ... [...] That would not be valid I (now) believe. The transformation done by strxfrm() is opaque. For all we now, it could very well return: strxfrm("foo") -> "455" strxfrm("bar") -> "217" strxfrm("foobar") -> "45517" So comparing "foo1bar" "foo6bar" "foobar" with our function (with adapted width padding) would be memcmp()ing "4551217", "4556217" and "45517" resulting in potentially surprising orders. Even only concatenating the result of several strxfrm() is probably not valid an approach. It's probably OK if they're interspersed with NUL bytes as that byte can't be found in the result of strxfrm() (and sorts before anything else for memcmp()), but even then, when doing "human collation" sorting, it probably makes more sense to consider them as padding (like space) than the thing that must sort before anything else, so removing them (or replacing them with spaces to have a deterministic order) instead of inserting them as \0 in between several strxfrm()s is just as valid and maybe even preferable. I think that if I had to implement it, I'd take the lazy approach of: - in the C locale, use memcmp() with sequences of digits replaced with their 20-width 0-padding. - in other locales, strip the NULs and use strcoll() with sequences of digits replaced with their 20-width 0-padding (or memcmp() of the stxfrm() of the above, but with that 20x factor for the padding combined with a typical 5x factor for strxfrm(), the memory usage penalty may be too high). Or the first approach mentioned above even if Vincent wouldn't like it. -- Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-07 8:41 ` Stephane Chazelas @ 2017-06-17 18:11 ` Bart Schaefer 0 siblings, 0 replies; 16+ messages in thread From: Bart Schaefer @ 2017-06-17 18:11 UTC (permalink / raw) To: Zsh hackers list On Jun 7, 9:41am, Stephane Chazelas wrote: } } I think that if I had to implement it, I'd take the lazy } approach of: } - in the C locale, use memcmp() with sequences of digits } replaced with their 20-width 0-padding. } - in other locales, strip the NULs and use strcoll() with } sequences of digits replaced with their 20-width 0-padding } (or memcmp() of the stxfrm() of the above, but with that 20x } factor for the padding combined with a typical 5x factor for } strxfrm(), the memory usage penalty may be too high). I looked at this for a while and decided that I have neither the free time nor the correct runtime environment to tackle it. By my reading of the code, the following steps are necessary: A. Update configure.ac -- 1. check for availability of wcstrxfrm() or if not then strxfrm() 2. check for wcstrcoll() or strcoll() B. In sort.c:strmetasort() -- 1. If *unmetalenp is NULL, unmetafy all the strings and generate the corresponding lengths array 2. Starting with the unmetafied strings, step through them wide- character-wise to a. find any null bytes b. find any numeric substrings c. peform the appropriate transformation, e.g. expansion to 20 digits with leading zeros; this regenerates the *cmp and len values in each SortElt 3. Apply any available strxfrm variant to every *cmp in SortElt array, updating len as needed C. Update sort.c:eltpcmp() such that -- 1. In C locale or if there is a strxfrm or if there is no strcoll, apply strcmp(as->cmp, bs->cmp) 2. Else apply available variant of strcoll(as->cmp, bs->cmp) D. Similarly adapt zstrcmp(), which will have the side-effect of making the cases where it is applied a lot more computationally expensive because it can't precalculate like strmetasort(). E. Do all of the above while preserving various optimizations such as skipping step (B.2.b) when not numerically sorting, skipping (B.3) and using (C.1) when the original has no wide/metafied characters, etc. And of course we need to decide whether the inconsistency noted by Stephane that started this discussion is actually significant enough to undertake this effort and eat the side-effect noted in (D) and the memory consumption noted by Stephane. We might want to do (A.2) even if nothing else. Technically when we don't already have the strings unmetafied by the caller we could do (B.1) and (B.2) simultaneously, but that adds complexity to the code in the (B.2) loop. Volunteers welcome, I'm dropping this as of now. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-03 21:16 ` Stephane Chazelas 2017-06-04 0:07 ` Bart Schaefer @ 2017-06-06 14:44 ` Vincent Lefevre 2017-06-06 16:47 ` Stephane Chazelas 1 sibling, 1 reply; 16+ messages in thread From: Vincent Lefevre @ 2017-06-06 14:44 UTC (permalink / raw) To: zsh-workers; +Cc: Bart Schaefer On 2017-06-03 22:16:46 +0100, Stephane Chazelas wrote: > 2017-06-02 16:19:05 -0700, Bart Schaefer: > > Well, one could argue that "-10" should be treated as negative ten > > and therefore should sort before negative three, but I'm not sure > > we want to get into that. > > The (my at least) main usage for *(n) is to sort version numbers > like zsh-3.0, zsh-3.1, zsh-4. So handling negative numbers > wouldn't help in those cases. I often uses "-" as a separator, so that I expect "foo-1" to come before "foo-2". And note that in Unicode, the real minus sign is U+2212. > When comparing "zsh-3" with "zsh2", we compare the non-numeric > prefix: "zsh-" and "zsh". And already, at that point, "zsh" is > less than "zsh-", so we stop here (zsh2 < zsh-3) I don't think this is the correct method. In some locales, digits come after "-". So, IMHO, "zsh-0" should be compared with "zsh0". I expect numeric sort and the normal sort be equivalent when all numbers have a single digit. Numeric sort is just a generalization to an infinite digit set (a number being regarded as an element of this digit set). -- Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Surprising behaviour with numeric glob sort 2017-06-06 14:44 ` Vincent Lefevre @ 2017-06-06 16:47 ` Stephane Chazelas 0 siblings, 0 replies; 16+ messages in thread From: Stephane Chazelas @ 2017-06-06 16:47 UTC (permalink / raw) To: zsh-workers, Bart Schaefer 2017-06-06 16:44:59 +0200, Vincent Lefevre: > On 2017-06-03 22:16:46 +0100, Stephane Chazelas wrote: > > 2017-06-02 16:19:05 -0700, Bart Schaefer: > > > Well, one could argue that "-10" should be treated as negative ten > > > and therefore should sort before negative three, but I'm not sure > > > we want to get into that. > > > > The (my at least) main usage for *(n) is to sort version numbers > > like zsh-3.0, zsh-3.1, zsh-4. So handling negative numbers > > wouldn't help in those cases. > > I often uses "-" as a separator, so that I expect "foo-1" to come > before "foo-2". > > And note that in Unicode, the real minus sign is U+2212. Maybe so, but in every programming language, hyphen-minus is used instead for negation. In any case, I don't think Bart (who brought it up in the first place) was suggesting we should change the behaviour in that regard. > > When comparing "zsh-3" with "zsh2", we compare the non-numeric > > prefix: "zsh-" and "zsh". And already, at that point, "zsh" is > > less than "zsh-", so we stop here (zsh2 < zsh-3) > > I don't think this is the correct method. In some locales, digits > come after "-". So, IMHO, "zsh-0" should be compared with "zsh0". > > I expect numeric sort and the normal sort be equivalent when all > numbers have a single digit. Numeric sort is just a generalization > to an infinite digit set (a number being regarded as an element > of this digit set). [...] Of the approaches mentioned so far, only the strcoll() (not strxfrm()) ones with 0-padding of numbers would satisfy that. $ echo * zsh0 zsh-0 zsh1 zsh-1 zsh10 zsh-10 zsh2 zsh-2 $ echo *(n) zsh0 zsh-0 zsh1 zsh-1 zsh2 zsh10 zsh-2 zsh-10 $ n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}} $ echo *(o+n) zsh0 zsh-0 zsh1 zsh-1 zsh2 zsh-2 zsh10 zsh-10 The approaches that split into non-numerical and numerical parts (even those that concatenate the result of several strxfrm() with 0-padded numbers) would sort the above as zsh0 zsh1 zsh2 zsh10 zsh-0 zsh-1 zsh-2 zsh-10 (personally, I prefer that order, but you might argue that I should change the locale to C if I expect that order I suppose. Do you have a real-life example where the other order may be preferable?). That 0-padding could still potentially be invalid if there were collating elements that end in decimal digits in the locale. Note that there are charsets like GB18030 that have characters whose encoding ends in the 0x30 byte (the encoding of 0) Â (\uc2) (and several thousand others) is one of them: $ LC_ALL=zh_CN.gb18030 zsh -c 'printf "\uc2"' | hd 00000000 81 30 87 30 |.0.0| 00000004 My n() REPLY=${REPLY//(#m)<->/${(l:20::0:)MATCH}} zsh function would not replace those 0s that are not 0s because the 0x30 are part of other characters, but the equivalent done in C in zsh would have to take that into account I suppose (no walking the byte values to find ASCII decimal digits) -- Stephane ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-06-17 18:11 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-05-31 21:24 Surprising behaviour with numeric glob sort Stephane Chazelas 2017-06-01 22:29 ` Bart Schaefer 2017-06-02 9:03 ` Stephane Chazelas 2017-06-02 23:19 ` Bart Schaefer 2017-06-03 21:16 ` Stephane Chazelas 2017-06-04 0:07 ` Bart Schaefer 2017-06-04 17:31 ` Stephane Chazelas 2017-06-04 22:01 ` Bart Schaefer 2017-06-05 11:54 ` Stephane Chazelas 2017-06-05 19:15 ` Stephane Chazelas 2017-06-06 3:13 ` Bart Schaefer 2017-06-06 9:22 ` Stephane Chazelas 2017-06-07 8:41 ` Stephane Chazelas 2017-06-17 18:11 ` Bart Schaefer 2017-06-06 14:44 ` Vincent Lefevre 2017-06-06 16:47 ` Stephane Chazelas
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).