* Incorrect sorting of Polish characters @ 2016-07-16 17:17 Michał Bartoszkiewicz 2016-07-16 20:07 ` Bart Schaefer 0 siblings, 1 reply; 7+ messages in thread From: Michał Bartoszkiewicz @ 2016-07-16 17:17 UTC (permalink / raw) To: zsh-workers I have noticed that some Polish characters (at least 'ć' and 'ę' – U+0107 and U+0119) are sorted incorrectly in glob expansion (but correctly in other contexts). Example: $ export LC_ALL=pl_PL.UTF-8 $ names=(a b c d e f $'\u0105' $'\u0107' $'\u0119') $ print -o $names a ą b c ć d e ę f $ touch $names; print ? a ą ć ę b c d e f The ordering given by print -o is the correct one. I am using zsh 5.2, but the problem also appears in git revision 3e61af3. Thanks, -- Michał Bartoszkiewicz <mbartoszkiewicz@gmail.com> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Incorrect sorting of Polish characters 2016-07-16 17:17 Incorrect sorting of Polish characters Michał Bartoszkiewicz @ 2016-07-16 20:07 ` Bart Schaefer 2016-07-18 9:33 ` Peter Stephenson 0 siblings, 1 reply; 7+ messages in thread From: Bart Schaefer @ 2016-07-16 20:07 UTC (permalink / raw) To: zsh-workers On Jul 16, 7:17pm, M. Bartoszkiewicz wrote: } } I have noticed that some Polish characters } are sorted incorrectly in glob expansion (but } correctly in other contexts). Both parameter expansion and globbing are using qsort() and each passes a pointer to a function that calls strcoll() underneath, so I suspect this is a (removal of) metafication issue, because that's done in the parameter function but not in the globbing one. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Incorrect sorting of Polish characters 2016-07-16 20:07 ` Bart Schaefer @ 2016-07-18 9:33 ` Peter Stephenson 2016-07-18 10:17 ` Peter Stephenson 0 siblings, 1 reply; 7+ messages in thread From: Peter Stephenson @ 2016-07-18 9:33 UTC (permalink / raw) To: zsh-workers On Sat, 16 Jul 2016 13:07:18 -0700 Bart Schaefer <schaefer@brasslantern.com> wrote: > On Jul 16, 7:17pm, M. Bartoszkiewicz wrote: > } > } I have noticed that some Polish characters > } are sorted incorrectly in glob expansion (but > } correctly in other contexts). > > Both parameter expansion and globbing are using qsort() and each > passes a pointer to a function that calls strcoll() underneath, so I > suspect this is a (removal of) metafication issue, because that's > done in the parameter function but not in the globbing one. A simple-minded change to pass strcoll() unmetafied versions of the strings does seem to fix the problem, so it looks like this is the case. However, that's not the right fix as we only want to unmetafy once per input string, not once per comparison, and below the call to qsort() there's quite a lot of internal string handling. An equally simple-minded fix around the call to qsort() (saving and restoring the strings) didn't seem to work. So this needs a bit more thought. This should be added to the tests when it's passing. pws ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Incorrect sorting of Polish characters 2016-07-18 9:33 ` Peter Stephenson @ 2016-07-18 10:17 ` Peter Stephenson 2016-07-20 5:05 ` Bart Schaefer 0 siblings, 1 reply; 7+ messages in thread From: Peter Stephenson @ 2016-07-18 10:17 UTC (permalink / raw) To: zsh-workers On Mon, 18 Jul 2016 10:33:29 +0100 Peter Stephenson <p.stephenson@samsung.com> wrote: > On Sat, 16 Jul 2016 13:07:18 -0700 > Bart Schaefer <schaefer@brasslantern.com> wrote: > > On Jul 16, 7:17pm, M. Bartoszkiewicz wrote: > > } I have noticed that some Polish characters > > } are sorted incorrectly in glob expansion (but > > } correctly in other contexts). > > A simple-minded change to pass strcoll() unmetafied versions of the > strings does seem to fix the problem, so it looks like this is the > case. However, that's not the right fix as we only want to unmetafy > once per input string, not once per comparison, and below the call to > qsort() there's quite a lot of internal string handling. An equally > simple-minded fix around the call to qsort() (saving and restoring the > strings) didn't seem to work. So this needs a bit more thought. Adding an umetafied entry to the glob match that only gets used for sorting seems to do the trick. I think an additional single pass through the array of matches isn't a big deal. Possibly the sort code needs a check through to confirm it really is unmeta-friendly for globbing as there are different ways in. Any other suggestions? pws diff --git a/Src/glob.c b/Src/glob.c index 2051016..146b4db 100644 --- a/Src/glob.c +++ b/Src/glob.c @@ -41,7 +41,10 @@ typedef struct gmatch *Gmatch; struct gmatch { + /* Metafied file name */ char *name; + /* Unmetafied file name; embedded nulls can't occur in file names */ + char *uname; /* * Array of sort strings: one for each GS_EXEC sort type in * the glob qualifiers. @@ -911,7 +914,8 @@ gmatchcmp(Gmatch a, Gmatch b) for (i = gf_nsorts, s = gf_sortlist; i; i--, s++) { switch (s->tp & ~GS_DESC) { case GS_NAME: - r = zstrcmp(b->name, a->name, gf_numsort ? SORTIT_NUMERICALLY : 0); + r = zstrcmp(b->uname, a->uname, + gf_numsort ? SORTIT_NUMERICALLY : 0); break; case GS_DEPTH: { @@ -1859,6 +1863,7 @@ zglob(LinkList list, LinkNode np, int nountok) int nexecs = 0; struct globsort *sortp; struct globsort *lastsortp = gf_sortlist + gf_nsorts; + Gmatch gmptr; /* First find out if there are any GS_EXECs, counting them. */ for (sortp = gf_sortlist; sortp < lastsortp; sortp++) @@ -1910,6 +1915,29 @@ zglob(LinkList list, LinkNode np, int nountok) } } + /* + * Where necessary, create unmetafied version of names + * for comparison. If no Meta characters just point + * to original string. All on heap. + */ + for (gmptr = matchbuf; gmptr < matchptr; gmptr++) + { + char *nptr; + for (nptr = gmptr->name; *nptr; nptr++) + { + if (*nptr == Meta) + break; + } + if (*nptr == Meta) + { + int dummy; + gmptr->uname = dupstring(gmptr->name); + unmetafy(gmptr->uname, &dummy); + } else { + gmptr->uname = gmptr->name; + } + } + /* Sort arguments in to lexical (and possibly numeric) order. * * This is reversed to facilitate insertion into the list. */ qsort((void *) & matchbuf[0], matchct, sizeof(struct gmatch), diff --git a/Test/D07multibyte.ztst b/Test/D07multibyte.ztst index dedf241..1b1d042 100644 --- a/Test/D07multibyte.ztst +++ b/Test/D07multibyte.ztst @@ -562,3 +562,20 @@ } : $functions) 0:Multibtye handled of functions parameter + + if [[ -n ${$(locale -a 2>/dev/null)[(R)pl_PL.utf8]} ]]; then + ( + export LC_ALL=pl_PL.UTF-8 + local -a names=(a b c d e f $'\u0105' $'\u0107' $'\u0119') + print -o $names + mkdir -p plchars + cd plchars + touch $names + print ? + ) + else + ZTST_skip="No Polish UTF-8 local found, skipping sort test" + fi +0:Sorting of metafied Polish characters +>a ą b c ć d e ę f +>a ą b c ć d e ę f ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Incorrect sorting of Polish characters 2016-07-18 10:17 ` Peter Stephenson @ 2016-07-20 5:05 ` Bart Schaefer 2016-07-20 8:35 ` Peter Stephenson 0 siblings, 1 reply; 7+ messages in thread From: Bart Schaefer @ 2016-07-20 5:05 UTC (permalink / raw) To: zsh-workers On Jul 18, 11:17am, Peter Stephenson wrote: } } Adding an umetafied entry to the glob match that only gets used for } sorting seems to do the trick. I think an additional single pass } through the array of matches isn't a big deal. Hm. On a glob of 16794 files it's about 4% slower over 10 trials with the extra pass over the array of matches. I don't suppose there's some way to store the original file name at the time metafication is applied rather than metafying and then unmetafying again? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Incorrect sorting of Polish characters 2016-07-20 5:05 ` Bart Schaefer @ 2016-07-20 8:35 ` Peter Stephenson 2016-07-22 0:38 ` Bart Schaefer 0 siblings, 1 reply; 7+ messages in thread From: Peter Stephenson @ 2016-07-20 8:35 UTC (permalink / raw) To: zsh-workers On Tue, 19 Jul 2016 22:05:08 -0700 Bart Schaefer <schaefer@brasslantern.com> wrote: > On Jul 18, 11:17am, Peter Stephenson wrote: > } > } Adding an umetafied entry to the glob match that only gets used for > } sorting seems to do the trick. I think an additional single pass > } through the array of matches isn't a big deal. > > Hm. On a glob of 16794 files it's about 4% slower over 10 trials with > the extra pass over the array of matches. I don't suppose there's some > way to store the original file name at the time metafication is applied > rather than metafying and then unmetafying again? It already works like that, and it only copies and unmetafies if there's a Meta character. It might be possible to squeeze this code somewhere else at the expense of clarity and likelihood of bugs. There are about four places where the name is set which could be turned into a function. Daniel pointed out strchr() would be neater. pws diff --git a/Src/glob.c b/Src/glob.c index 146b4db..850405f 100644 --- a/Src/glob.c +++ b/Src/glob.c @@ -1922,13 +1922,7 @@ zglob(LinkList list, LinkNode np, int nountok) */ for (gmptr = matchbuf; gmptr < matchptr; gmptr++) { - char *nptr; - for (nptr = gmptr->name; *nptr; nptr++) - { - if (*nptr == Meta) - break; - } - if (*nptr == Meta) + if (strchr(gmptr->name, Meta)) { int dummy; gmptr->uname = dupstring(gmptr->name); ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Incorrect sorting of Polish characters 2016-07-20 8:35 ` Peter Stephenson @ 2016-07-22 0:38 ` Bart Schaefer 0 siblings, 0 replies; 7+ messages in thread From: Bart Schaefer @ 2016-07-22 0:38 UTC (permalink / raw) To: zsh-workers On Jul 20, 9:35am, Peter Stephenson wrote: } Subject: Re: Incorrect sorting of Polish characters } } On Tue, 19 Jul 2016 22:05:08 -0700 } Bart Schaefer <schaefer@brasslantern.com> wrote: } > Hm. On a glob of 16794 files it's about 4% slower over 10 trials with } > the extra pass over the array of matches. I don't suppose there's some } > way to store the original file name at the time metafication is applied } > rather than metafying and then unmetafying again? } } It already works like that, and it only copies and unmetafies if there's } a Meta character. What I meant was (now that I have poked around a bit further), maybe zreaddir() should be able to return unmetatfied strings on request, which scanner() could then metafy before calling pattry() and when a match is found, pass both to insert() ? Seems wasteful for zreaddir() to unconditionally metafy when the caller may need the name unmetafied. However, this does mean passing both strings through a lot of layers, so maybe not worth it for 4%. The strchr() does shave it down a fraction. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-07-22 15:51 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-16 17:17 Incorrect sorting of Polish characters Michał Bartoszkiewicz 2016-07-16 20:07 ` Bart Schaefer 2016-07-18 9:33 ` Peter Stephenson 2016-07-18 10:17 ` Peter Stephenson 2016-07-20 5:05 ` Bart Schaefer 2016-07-20 8:35 ` Peter Stephenson 2016-07-22 0:38 ` Bart Schaefer
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).