zsh-workers
 help / color / mirror / code / Atom feed
* zsh/complist colours improperly handle multibyte characters
@ 2016-10-21  1:09 ` Danielle McLean
  2016-10-21  4:07   ` Bart Schaefer
  2016-10-21  8:33   ` Peter Stephenson
  0 siblings, 2 replies; 11+ messages in thread
From: Danielle McLean @ 2016-10-21  1:09 UTC (permalink / raw)
  To: zsh-workers; +Cc: danielsh

Although zsh patterns usually have full support for multibyte characters, a zsh pattern containing certain multibyte characters will fail to match if provided to zsh/complist using $ZLS_COLORS or the list-colors zstyle. For example, the following zstyles will fail to match and colour anything:

    zstyle :completion:*:options list-separator │
    zstyle :completion:*:options =*│*=32
    ls -l<TAB>		

This is not the case for *all* multibyte characters. For example, the following zstyles successfully colour the entire option green:

    zstyle :completion:*:options list-separator ©
    zstyle :completion:*:options =*©*=32
    ls -l<TAB>

Additionally, when applying multiple colours with a (#b) pattern, multibyte characters break the colouring even if matched with a wildcard like * rather than a literal. In this example, the options are blue and their descriptions green - except for the last two characters of the description which are not coloured:

    zstyle :completion:*:options list-separator │
    zstyle :completion:*:options "=(#b)([^ ]#)(*)=0=34=32"
    ls -l<TAB>

This issue also occurs if © is used as the separator: the complist will colour all but the last character of the description, in that case.

I believe the reason behind this behaviour is that in UTF-8, © is a two-byte character and │ is a three-byte character. Presumably, zsh/complist's pattern parsing is only aware of one- and two-byte chars, and its (#b)-based grouping mistakenly assumes that all characters are single bytes.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-21  1:09 ` zsh/complist colours improperly handle multibyte characters Danielle McLean
@ 2016-10-21  4:07   ` Bart Schaefer
  2016-10-23 17:30     ` Peter Stephenson
       [not found]     ` <20161023184641.4549e10a@ntlworld.com>
  2016-10-21  8:33   ` Peter Stephenson
  1 sibling, 2 replies; 11+ messages in thread
From: Bart Schaefer @ 2016-10-21  4:07 UTC (permalink / raw)
  To: zsh-workers; +Cc: danielsh

On Oct 21, 12:09pm, Danielle McLean wrote:
}
} Presumably, zsh/complist's pattern parsing is only aware of one- and
} two-byte chars, and its (#b)-based grouping mistakenly assumes that
} all characters are single bytes.

Reaching this conclusion might be understandable, but it is pretty far
off the mark.

The problem with the (#b) example is not that the pattern has matched
improperly, but that #ifdef MULTIBYTE_SUPPORT has never been worked
into complist.c:clprintfmt().  So it is counting the bytes rather than
the display width when deciding where to start/stop coloring.

I'm not going to dive into that, I don't know the multibyte handling
code well enough.

The problem with the vertical-bar example is that the pattern did not
compile correctly at all.  The same effect occurs if you use "#" in
that position -- it results in an invalid pattern which is silently
ignored.

Stepping through with gdb for the vertical-bar pattern, I get:

 pattern.c:1530: BUG: - missing from numeric glob

I thought this was a metafication problem, and indeed if I metafy()
the string passed to patcompile() then the compilation succeeds, but
then the resulting program fails to match for even simple patterns
so nothing is ever colored.

This is about as far as I'm going to be able to get with this.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-21  1:09 ` zsh/complist colours improperly handle multibyte characters Danielle McLean
  2016-10-21  4:07   ` Bart Schaefer
@ 2016-10-21  8:33   ` Peter Stephenson
  2016-10-21 15:48     ` Bart Schaefer
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Stephenson @ 2016-10-21  8:33 UTC (permalink / raw)
  To: Danielle McLean, zsh-workers; +Cc: danielsh

On Fri, 21 Oct 2016 12:09:34 +1100
Danielle McLean <gopsychonauts@gmail.com> wrote:
> Although zsh patterns usually have full support for multibyte
> characters, a zsh pattern containing certain multibyte characters will
> fail to match if provided to zsh/complist using $ZLS_COLORS or the
> list-colors zstyle.

Yes, unfortunately character-by-character analysis is built into
completion at quite a low level.

I don't know a great deal about the complist code --- except it's
appeared utterly opaque when I have looked at it --- but I did
try to get grips with the matching control, where there are similar
problems, and utterly failed.  The assumption that a byte in the
string on the command line and a byte in the test string can be
considered equivalent is all-pervasive, but the sort of tricks in use
didn't make converting to wide characters a particularly attractive
optin, either. It became clear this was a huge job I didn't have a clue
about.

complist may not be quite that bad, and indeed character *matching*
isn't a big part of it so this may not be a particularly difficult
problem when it comes down to it, but unless anyone makes it their
life's work to get to grips with it it's probably stuck --- it's clear
Bart and I aren't ever going to have that sort of time.

pws


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-21  8:33   ` Peter Stephenson
@ 2016-10-21 15:48     ` Bart Schaefer
  0 siblings, 0 replies; 11+ messages in thread
From: Bart Schaefer @ 2016-10-21 15:48 UTC (permalink / raw)
  To: zsh-workers; +Cc: danielsh

On Oct 21,  9:33am, Peter Stephenson wrote:
}
} Yes, unfortunately character-by-character analysis is built into
} completion at quite a low level.

I think this particular issue may be a lot more tractible than you
believe.  clnicezputs() already has the MULTIBYTE_SUPPORT branch
that seems also to be needed by clprintfmt() -- I could probably
make an attempt to translate, but someone who has actually worked
with e.g. ZMB_nicewidth() and re-composing of bytes into multibyte
characters could probably do it faster.

The other part seems a plain pattern compilation issue.  Neither of
these gets involved with comparing strings on the line to strings
in the candidates.  This is all on the tail end where we already
have the matches and we're displaying them.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-21  4:07   ` Bart Schaefer
@ 2016-10-23 17:30     ` Peter Stephenson
  2016-10-23 18:23       ` Bart Schaefer
       [not found]     ` <20161023184641.4549e10a@ntlworld.com>
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Stephenson @ 2016-10-23 17:30 UTC (permalink / raw)
  To: zsh-workers

On Thu, 20 Oct 2016 21:07:35 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> The problem with the (#b) example is not that the pattern has matched
> improperly, but that #ifdef MULTIBYTE_SUPPORT has never been worked
> into complist.c:clprintfmt().  So it is counting the bytes rather than
> the display width when deciding where to start/stop coloring.

So something like this might be better?

This is counting characters properly, but not taking into account the
width of the character may be greater than 1.  As the outer loop is
assuming one position per character I don't know how to fix that
properly.  Probably incrementing cc by the width and doing something
clever with the "beg =" is a further imporvement, but I have no idea
what to about i.

pws

diff --git a/Src/Zle/complist.c b/Src/Zle/complist.c
index 8aeb6c3..39ac782 100644
--- a/Src/Zle/complist.c
+++ b/Src/Zle/complist.c
@@ -662,7 +662,9 @@ clprintfmt(char *p, int ml)
 
     initiscol();
 
-    for (; *p; p++) {
+    while (*p) {
+	convchar_t chr;
+	int chrlen = MB_METACHARLENCONV(p, &chr);
 	doiscol(i++);
 	cc++;
 	if (*p == '\n') {
@@ -673,11 +675,16 @@ clprintfmt(char *p, int ml)
 	if (ml == mlend - 1 && (cc % zterm_columns) == zterm_columns - 1)
 	    return 0;
 
-	if (*p == Meta) {
+	while (chrlen) {
+	    if (*p == Meta) {
+		p++;
+		chrlen--;
+		putc(*p ^ 32, shout);
+	    } else
+		putc(*p, shout);
+	    chrlen--;
 	    p++;
-	    putc(*p ^ 32, shout);
-	} else
-	    putc(*p, shout);
+	}
 	if ((beg = !(cc % zterm_columns)))
 	    ml++;
 	if (mscroll && !(cc % zterm_columns) &&


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-23 17:30     ` Peter Stephenson
@ 2016-10-23 18:23       ` Bart Schaefer
  0 siblings, 0 replies; 11+ messages in thread
From: Bart Schaefer @ 2016-10-23 18:23 UTC (permalink / raw)
  To: zsh-workers

On Oct 23,  6:30pm, Peter Stephenson wrote:
}
} So something like this might be better?

Yes, this appears to solve the issue, at least for multibyte characters
that take up only one screen column (as you noted).

} This is counting characters properly, but not taking into account the
} width of the character may be greater than 1.  As the outer loop is
} assuming one position per character I don't know how to fix that

Looks like doiscol() would need to have the character width passed down
to it to correctly manipulate the color position arrays, and then both
i += chrwid and cc += chrwid or something like that.

As long as sendpos[] and curiscols[] don't change color in the middle
of a multi-column character I think the output part should be OK ...
but I'm not certain, either.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
       [not found]       ` <161023105652.ZM3309@torch.brasslantern.com>
@ 2016-10-23 18:59         ` Peter Stephenson
  2016-10-23 19:34           ` Bart Schaefer
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Stephenson @ 2016-10-23 18:59 UTC (permalink / raw)
  To: Zsh hackers list

On Sun, 23 Oct 2016 10:56:52 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> [This is off-list, did you mean to?]

No, that was just stuipdity.

> On Oct 23,  6:46pm, Peter Stephenson wrote:
> } Subject: Re: zsh/complist colours improperly handle multibyte characters
> }
> } On Thu, 20 Oct 2016 21:07:35 -0700
> } Bart Schaefer <schaefer@brasslantern.com> wrote:
> } > Stepping through with gdb for the vertical-bar pattern, I get:
> } > 
> } >  pattern.c:1530: BUG: - missing from numeric glob
> } 
> } Sounds like they're being treated as pattern characters when that's not
> } what you want?
> 
> No, sorry, this is a UTF-8 full-line-height vertical-bar, not ascii pipe.
> It's incorrectly interpreted as a left angle bracket pattern character,
> if that BUG message is accurate.

Ah, then there's a good chance this is indeed a problem with
zshtokenize.  We probably ought at least to pass through metafied
characters.  I don't know that fits this particular case, but it's the
obvious problem.

pws

diff --git a/Src/glob.c b/Src/glob.c
index a845c5f..0442bbf 100644
--- a/Src/glob.c
+++ b/Src/glob.c
@@ -3499,6 +3499,10 @@ zshtokenize(char *s, int flags)
     for (; *s; s++) {
       cont:
 	switch (*s) {
+	case Meta:
+	    /* skip Meta as well as following character */
+	    s++;
+	    break;
 	case Bnull:
 	case Bnullkeep:
 	case '\\':


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-23 18:59         ` Peter Stephenson
@ 2016-10-23 19:34           ` Bart Schaefer
  2016-10-25 10:44             ` Peter Stephenson
  0 siblings, 1 reply; 11+ messages in thread
From: Bart Schaefer @ 2016-10-23 19:34 UTC (permalink / raw)
  To: Zsh hackers list

On Oct 23,  7:59pm, Peter Stephenson wrote:
} Subject: Re: zsh/complist colours improperly handle multibyte characters
}
} On Sun, 23 Oct 2016 10:56:52 -0700
} Bart Schaefer <schaefer@brasslantern.com> wrote:
} > No, sorry, this is a UTF-8 full-line-height vertical-bar, not ascii pipe.
} > It's incorrectly interpreted as a left angle bracket pattern character,
} > if that BUG message is accurate.
} 
} Ah, then there's a good chance this is indeed a problem with
} zshtokenize.  We probably ought at least to pass through metafied
} characters.  I don't know that fits this particular case, but it's the
} obvious problem.

Nope, the zshtokenize patch doesn't help in this case at all.  I still
get the BUG: message.  Strangely (?) I do NOT get that message if I use
the character directly in a pattern expression such as [[ ... ]], so
it has something to do with the way compdescribe is passing it around.

The three bytes in the UTF-8 character are e2 94 82 if that helps.  See
the examples in 39695 except that you need to add "list-colors" as the
context in the second zstyle in each case.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-23 19:34           ` Bart Schaefer
@ 2016-10-25 10:44             ` Peter Stephenson
  2016-10-25 15:59               ` Bart Schaefer
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Stephenson @ 2016-10-25 10:44 UTC (permalink / raw)
  To: Zsh hackers list

On Sun, 23 Oct 2016 12:34:16 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Oct 23,  7:59pm, Peter Stephenson wrote:
> } Subject: Re: zsh/complist colours improperly handle multibyte characters
> }
> } On Sun, 23 Oct 2016 10:56:52 -0700
> } Bart Schaefer <schaefer@brasslantern.com> wrote:
> } > No, sorry, this is a UTF-8 full-line-height vertical-bar, not ascii pipe.
> } > It's incorrectly interpreted as a left angle bracket pattern character,
> } > if that BUG message is accurate.
> } 
> } Ah, then there's a good chance this is indeed a problem with
> } zshtokenize.  We probably ought at least to pass through metafied
> } characters.  I don't know that fits this particular case, but it's the
> } obvious problem.
> 
> Nope, the zshtokenize patch doesn't help in this case at all.  I still
> get the BUG: message.  Strangely (?) I do NOT get that message if I use
> the character directly in a pattern expression such as [[ ... ]], so
> it has something to do with the way compdescribe is passing it around.

For me this was working just with metafying the string in complist when
it goes to patcompile().  The patch I posted makes this a bit safer in
theory, though in fact I don't think we hit the problem in practice.

In the previous code, the input string is

* E2 94 82 *

That 94 looks like a token.  On tokenisation we get

87 E2 94 82 87

The * has become a token, but 94 still looks like a token because it's
not protected.  So the pattern compiler turns it back into the
corresponding string form, '<', when it gets an incomplete multibyte
pattern.  This makes the pattern look invalid, so it gives up.  Later you
get "<" as a token, which doesn't work as there's no numeric expression.

To fix this safely, we need first to metafy the input string,

* E2 83 B4 82 *

then tokenise it with the change I previously posted to skip Meta,

87 E2 83 B4 82 87

What the extra change is doing is making sure that 83 B4 goes through as
is --- a metafied character is by definition escaped from tokenisation.
However, because this only happens when bit 7 is set, and we'll never
tokenise such a character, I don't think it actually makes a
difference.  But I've left it in as it respects the intention.

pws

diff --git a/Src/Zle/complist.c b/Src/Zle/complist.c
index 39ac782..d4672a1 100644
--- a/Src/Zle/complist.c
+++ b/Src/Zle/complist.c
@@ -415,6 +415,7 @@ getcoldef(char *s)
 		break;
 	    *s++ = '\0';
 	}
+	p = metafy(p, strlen(p), META_USEHEAP);
 	tokenize(p);
 	if ((prog = patcompile(p, 0, NULL))) {
 	    Patcol pc, po;
diff --git a/Src/glob.c b/Src/glob.c
index a845c5f..50f6dce 100644
--- a/Src/glob.c
+++ b/Src/glob.c
@@ -3499,6 +3499,10 @@ zshtokenize(char *s, int flags)
     for (; *s; s++) {
       cont:
 	switch (*s) {
+	case Meta:
+	    /* skip both Meta and following character */
+	    s++;
+	    break;
 	case Bnull:
 	case Bnullkeep:
 	case '\\':


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-25 10:44             ` Peter Stephenson
@ 2016-10-25 15:59               ` Bart Schaefer
  2016-10-25 16:05                 ` Peter Stephenson
  0 siblings, 1 reply; 11+ messages in thread
From: Bart Schaefer @ 2016-10-25 15:59 UTC (permalink / raw)
  To: Zsh hackers list

On Oct 25, 11:44am, Peter Stephenson wrote:
}
} To fix this safely, we need first to metafy the input string,
} 
} * E2 83 B4 82 *
} 
} then tokenise it with the change I previously posted to skip Meta,

This seems to have done it.  I tried the metafy before tokenize route,
but without the change to zshtokenize that wasn't sufficient.

Do we also need this here?  And isn't it slightly better to pass -1
to metafy() rather than call strlen()?

diff --git a/Src/Zle/complist.c b/Src/Zle/complist.c
index d4672a1..2edaf6e 100644
--- a/Src/Zle/complist.c
+++ b/Src/Zle/complist.c
@@ -347,9 +347,10 @@ getcoldef(char *s)
 	    char sav = p[1];
 
 	    p[1] = '\0';
+	    s = metafy(s, -1, META_USEHEAP);
 	    tokenize(s);
 	    gprog = patcompile(s, 0, NULL);
-	    p[1]  =sav;
+	    p[1] = sav;
 
 	    s = p + 1;
 	}
@@ -415,7 +416,7 @@ getcoldef(char *s)
 		break;
 	    *s++ = '\0';
 	}
-	p = metafy(p, strlen(p), META_USEHEAP);
+	p = metafy(p, -1, META_USEHEAP);
 	tokenize(p);
 	if ((prog = patcompile(p, 0, NULL))) {
 	    Patcol pc, po;


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: zsh/complist colours improperly handle multibyte characters
  2016-10-25 15:59               ` Bart Schaefer
@ 2016-10-25 16:05                 ` Peter Stephenson
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Stephenson @ 2016-10-25 16:05 UTC (permalink / raw)
  To: Zsh hackers list

On Tue, 25 Oct 2016 08:59:20 -0700
Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Oct 25, 11:44am, Peter Stephenson wrote:
> }
> } To fix this safely, we need first to metafy the input string,
> } 
> } * E2 83 B4 82 *
> } 
> } then tokenise it with the change I previously posted to skip Meta,
> 
> This seems to have done it.  I tried the metafy before tokenize route,
> but without the change to zshtokenize that wasn't sufficient.
> 
> Do we also need this here?  And isn't it slightly better to pass -1
> to metafy() rather than call strlen()?

It looks highly likely.

pws


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-10-25 18:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20161021011017eucas1p1d1fceae920242e69b4426bc41cb3c064@eucas1p1.samsung.com>
2016-10-21  1:09 ` zsh/complist colours improperly handle multibyte characters Danielle McLean
2016-10-21  4:07   ` Bart Schaefer
2016-10-23 17:30     ` Peter Stephenson
2016-10-23 18:23       ` Bart Schaefer
     [not found]     ` <20161023184641.4549e10a@ntlworld.com>
     [not found]       ` <161023105652.ZM3309@torch.brasslantern.com>
2016-10-23 18:59         ` Peter Stephenson
2016-10-23 19:34           ` Bart Schaefer
2016-10-25 10:44             ` Peter Stephenson
2016-10-25 15:59               ` Bart Schaefer
2016-10-25 16:05                 ` Peter Stephenson
2016-10-21  8:33   ` Peter Stephenson
2016-10-21 15:48     ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).