zsh-workers
 help / color / mirror / code / Atom feed
* Incorrect sorting of Polish characters
@ 2016-07-16 17:17 Michał Bartoszkiewicz
  2016-07-16 20:07 ` Bart Schaefer
  0 siblings, 1 reply; 7+ messages in thread
From: Michał Bartoszkiewicz @ 2016-07-16 17:17 UTC (permalink / raw)
  To: zsh-workers

I have noticed that some Polish characters (at least 'ć' and 'ę' –
U+0107 and U+0119) are sorted incorrectly in glob expansion (but
correctly in other contexts).

Example:
$ export LC_ALL=pl_PL.UTF-8
$ names=(a b c d e f $'\u0105' $'\u0107' $'\u0119')
$ print -o $names
a ą b c ć d e ę f
$ touch $names; print ?
a ą ć ę b c d e f

The ordering given by print -o is the correct one.

I am using zsh 5.2, but the problem also appears in git revision 3e61af3.

Thanks,
--
Michał Bartoszkiewicz <mbartoszkiewicz@gmail.com>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Incorrect sorting of Polish characters
  2016-07-16 17:17 Incorrect sorting of Polish characters Michał Bartoszkiewicz
@ 2016-07-16 20:07 ` Bart Schaefer
  2016-07-18  9:33   ` Peter Stephenson
  0 siblings, 1 reply; 7+ messages in thread
From: Bart Schaefer @ 2016-07-16 20:07 UTC (permalink / raw)
  To: zsh-workers

On Jul 16,  7:17pm, M. Bartoszkiewicz wrote:
}
} I have noticed that some Polish characters
} are sorted incorrectly in glob expansion (but
} correctly in other contexts).

Both parameter expansion and globbing are using qsort() and each
passes a pointer to a function that calls strcoll() underneath, so I
suspect this is a (removal of) metafication issue, because that's
done in the parameter function but not in the globbing one.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Incorrect sorting of Polish characters
  2016-07-16 20:07 ` Bart Schaefer
@ 2016-07-18  9:33   ` Peter Stephenson
  2016-07-18 10:17     ` Peter Stephenson
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Stephenson @ 2016-07-18  9:33 UTC (permalink / raw)
  To: zsh-workers

On Sat, 16 Jul 2016 13:07:18 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Jul 16,  7:17pm, M. Bartoszkiewicz wrote:
> }
> } I have noticed that some Polish characters
> } are sorted incorrectly in glob expansion (but
> } correctly in other contexts).
> 
> Both parameter expansion and globbing are using qsort() and each
> passes a pointer to a function that calls strcoll() underneath, so I
> suspect this is a (removal of) metafication issue, because that's
> done in the parameter function but not in the globbing one.

A simple-minded change to pass strcoll() unmetafied versions of the
strings does seem to fix the problem, so it looks like this is the
case.  However, that's not the right fix as we only want to unmetafy
once per input string, not once per comparison, and below the call to
qsort() there's quite a lot of internal string handling.  An equally
simple-minded fix around the call to qsort() (saving and restoring the
strings) didn't seem to work.  So this needs a bit more thought.

This should be added to the tests when it's passing.

pws


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Incorrect sorting of Polish characters
  2016-07-18  9:33   ` Peter Stephenson
@ 2016-07-18 10:17     ` Peter Stephenson
  2016-07-20  5:05       ` Bart Schaefer
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Stephenson @ 2016-07-18 10:17 UTC (permalink / raw)
  To: zsh-workers

On Mon, 18 Jul 2016 10:33:29 +0100
Peter Stephenson <p.stephenson@samsung.com> wrote:
> On Sat, 16 Jul 2016 13:07:18 -0700
> Bart Schaefer <schaefer@brasslantern.com> wrote:
> > On Jul 16,  7:17pm, M. Bartoszkiewicz wrote:
> > } I have noticed that some Polish characters
> > } are sorted incorrectly in glob expansion (but
> > } correctly in other contexts).
> 
> A simple-minded change to pass strcoll() unmetafied versions of the
> strings does seem to fix the problem, so it looks like this is the
> case.  However, that's not the right fix as we only want to unmetafy
> once per input string, not once per comparison, and below the call to
> qsort() there's quite a lot of internal string handling.  An equally
> simple-minded fix around the call to qsort() (saving and restoring the
> strings) didn't seem to work.  So this needs a bit more thought.

Adding an umetafied entry to the glob match that only gets used for
sorting seems to do the trick.  I think an additional single pass
through the array of matches isn't a big deal.  Possibly the sort code
needs a check through to confirm it really is unmeta-friendly for
globbing as there are different ways in.  Any other suggestions?

pws

diff --git a/Src/glob.c b/Src/glob.c
index 2051016..146b4db 100644
--- a/Src/glob.c
+++ b/Src/glob.c
@@ -41,7 +41,10 @@
 typedef struct gmatch *Gmatch;
 
 struct gmatch {
+    /* Metafied file name */
     char *name;
+    /* Unmetafied file name; embedded nulls can't occur in file names */
+    char *uname;
     /*
      * Array of sort strings:  one for each GS_EXEC sort type in
      * the glob qualifiers.
@@ -911,7 +914,8 @@ gmatchcmp(Gmatch a, Gmatch b)
     for (i = gf_nsorts, s = gf_sortlist; i; i--, s++) {
 	switch (s->tp & ~GS_DESC) {
 	case GS_NAME:
-	    r = zstrcmp(b->name, a->name, gf_numsort ? SORTIT_NUMERICALLY : 0);
+	    r = zstrcmp(b->uname, a->uname,
+			gf_numsort ? SORTIT_NUMERICALLY : 0);
 	    break;
 	case GS_DEPTH:
 	    {
@@ -1859,6 +1863,7 @@ zglob(LinkList list, LinkNode np, int nountok)
 	int nexecs = 0;
 	struct globsort *sortp;
 	struct globsort *lastsortp = gf_sortlist + gf_nsorts;
+	Gmatch gmptr;
 
 	/* First find out if there are any GS_EXECs, counting them. */
 	for (sortp = gf_sortlist; sortp < lastsortp; sortp++)
@@ -1910,6 +1915,29 @@ zglob(LinkList list, LinkNode np, int nountok)
 	    }
 	}
 
+	/*
+	 * Where necessary, create unmetafied version of names
+	 * for comparison.  If no Meta characters just point
+	 * to original string.  All on heap.
+	 */
+	for (gmptr = matchbuf; gmptr < matchptr; gmptr++)
+	{
+	    char *nptr;
+	    for (nptr = gmptr->name; *nptr; nptr++)
+	    {
+		if (*nptr == Meta)
+		    break;
+	    }
+	    if (*nptr == Meta)
+	    {
+		int dummy;
+		gmptr->uname = dupstring(gmptr->name);
+		unmetafy(gmptr->uname, &dummy);
+	    } else {
+		gmptr->uname = gmptr->name;
+	    }
+	}
+
 	/* Sort arguments in to lexical (and possibly numeric) order. *
 	 * This is reversed to facilitate insertion into the list.    */
 	qsort((void *) & matchbuf[0], matchct, sizeof(struct gmatch),
diff --git a/Test/D07multibyte.ztst b/Test/D07multibyte.ztst
index dedf241..1b1d042 100644
--- a/Test/D07multibyte.ztst
+++ b/Test/D07multibyte.ztst
@@ -562,3 +562,20 @@
   }
   : $functions)
 0:Multibtye handled of functions parameter
+
+  if [[ -n ${$(locale -a 2>/dev/null)[(R)pl_PL.utf8]} ]]; then
+  (
+    export LC_ALL=pl_PL.UTF-8
+    local -a names=(a b c d e f $'\u0105' $'\u0107' $'\u0119')
+    print -o $names
+    mkdir -p plchars
+    cd plchars
+    touch $names
+    print ?
+  )
+  else
+    ZTST_skip="No Polish UTF-8 local found, skipping sort test"
+  fi
+0:Sorting of metafied Polish characters
+>a ą b c ć d e ę f
+>a ą b c ć d e ę f


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Incorrect sorting of Polish characters
  2016-07-18 10:17     ` Peter Stephenson
@ 2016-07-20  5:05       ` Bart Schaefer
  2016-07-20  8:35         ` Peter Stephenson
  0 siblings, 1 reply; 7+ messages in thread
From: Bart Schaefer @ 2016-07-20  5:05 UTC (permalink / raw)
  To: zsh-workers

On Jul 18, 11:17am, Peter Stephenson wrote:
}
} Adding an umetafied entry to the glob match that only gets used for
} sorting seems to do the trick.  I think an additional single pass
} through the array of matches isn't a big deal.

Hm.  On a glob of 16794 files it's about 4% slower over 10 trials with
the extra pass over the array of matches.  I don't suppose there's some
way to store the original file name at the time metafication is applied
rather than metafying and then unmetafying again?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Incorrect sorting of Polish characters
  2016-07-20  5:05       ` Bart Schaefer
@ 2016-07-20  8:35         ` Peter Stephenson
  2016-07-22  0:38           ` Bart Schaefer
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Stephenson @ 2016-07-20  8:35 UTC (permalink / raw)
  To: zsh-workers

On Tue, 19 Jul 2016 22:05:08 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Jul 18, 11:17am, Peter Stephenson wrote:
> }
> } Adding an umetafied entry to the glob match that only gets used for
> } sorting seems to do the trick.  I think an additional single pass
> } through the array of matches isn't a big deal.
> 
> Hm.  On a glob of 16794 files it's about 4% slower over 10 trials with
> the extra pass over the array of matches.  I don't suppose there's some
> way to store the original file name at the time metafication is applied
> rather than metafying and then unmetafying again?

It already works like that, and it only copies and unmetafies if there's
a Meta character.

It might be possible to squeeze this code somewhere else at the expense
of clarity and likelihood of bugs.  There are about four places where
the name is set which could be turned into a function.

Daniel pointed out strchr() would be neater.

pws

diff --git a/Src/glob.c b/Src/glob.c
index 146b4db..850405f 100644
--- a/Src/glob.c
+++ b/Src/glob.c
@@ -1922,13 +1922,7 @@ zglob(LinkList list, LinkNode np, int nountok)
 	 */
 	for (gmptr = matchbuf; gmptr < matchptr; gmptr++)
 	{
-	    char *nptr;
-	    for (nptr = gmptr->name; *nptr; nptr++)
-	    {
-		if (*nptr == Meta)
-		    break;
-	    }
-	    if (*nptr == Meta)
+	    if (strchr(gmptr->name, Meta))
 	    {
 		int dummy;
 		gmptr->uname = dupstring(gmptr->name);


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Incorrect sorting of Polish characters
  2016-07-20  8:35         ` Peter Stephenson
@ 2016-07-22  0:38           ` Bart Schaefer
  0 siblings, 0 replies; 7+ messages in thread
From: Bart Schaefer @ 2016-07-22  0:38 UTC (permalink / raw)
  To: zsh-workers

On Jul 20,  9:35am, Peter Stephenson wrote:
} Subject: Re: Incorrect sorting of Polish characters
}
} On Tue, 19 Jul 2016 22:05:08 -0700
} Bart Schaefer <schaefer@brasslantern.com> wrote:
} > Hm.  On a glob of 16794 files it's about 4% slower over 10 trials with
} > the extra pass over the array of matches.  I don't suppose there's some
} > way to store the original file name at the time metafication is applied
} > rather than metafying and then unmetafying again?
} 
} It already works like that, and it only copies and unmetafies if there's
} a Meta character.

What I meant was (now that I have poked around a bit further), maybe
zreaddir() should be able to return unmetatfied strings on request,
which scanner() could then metafy before calling pattry() and when a
match is found, pass both to insert() ?

Seems wasteful for zreaddir() to unconditionally metafy when the
caller may need the name unmetafied.  However, this does mean passing
both strings through a lot of layers, so maybe not worth it for 4%.

The strchr() does shave it down a fraction.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-07-22 15:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-16 17:17 Incorrect sorting of Polish characters Michał Bartoszkiewicz
2016-07-16 20:07 ` Bart Schaefer
2016-07-18  9:33   ` Peter Stephenson
2016-07-18 10:17     ` Peter Stephenson
2016-07-20  5:05       ` Bart Schaefer
2016-07-20  8:35         ` Peter Stephenson
2016-07-22  0:38           ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).