problem in prompt in utf-8

zsh-workers
 help / color / mirror / code / Atom feed

* problem in prompt in utf-8
@ 2005-09-11 12:13 Zvi Har'El
  2005-09-11 16:55 ` Zvi Har'El
  2005-09-17 18:15 ` Peter Stephenson
  0 siblings, 2 replies; 6+ messages in thread
From: Zvi Har'El @ 2005-09-11 12:13 UTC (permalink / raw)
  To: Zsh hackers list; +Cc: Nadav Har'El

Hello,

I have started using zsh-4.3.0 from the CVS, in a uft-8 locale. I enjoy it
very much. However, I have a problem with the prompting. This is not new, but
since the completion now works nicely, I thought I'll mention it, since it is
not solved yet.

I have the setting

    PS1=%/$\ 

I expect that 

    print -P $PS1 

and

    pwd 

will give the same output, which will also be the zsh prompt (except the final
$ and space, of course). However, if the current directory name contains
hebrew letter, which are in the range U+05D0 to U+05EA, i.e., the utf-8
sequences have two bytes, with the first one is always 0xD7 (M-W) and the
second  in the range 0x90 (M-^P) to 0xAA (M-*). I mkdir'ed a directory which
has all the letters in this range:

/home/rl$ mkdir אבגדהוזחטיךכלםמןנסעףפץצקרשת

cd'ed to that directory:

/home/rl$ cd אבגדהוזחטיךכלםמןנסעףפץצקרשת 

I got as an echo a correct result:

~/אבגדהוזחטיךכלםמןנסעףפץצקרשת

The next prompt had invalid utf-8 sequences:

/home/rl/������������לםמןנסעףפץצקרשת$ 

To make it more specific, all the range U+05D0 to U+05DB,  (second byte 0x90
to 0x9ba) got invalid. I don't know exactly what is wrong. Notice that 'pwd'
produces

/home/rl/אבגדהוזחטיךכלםמןנסעףפץצקרשת

I.e, all the letters are correct, while 'print -P $PS1' produces

/home/rl/אבגדהוזחטיך�לםמןנסעףפץצקרשת$

With exactly one invalid utf-8 seqience, more specifically, U+05DB (second
byte 0x9ba) - the last one in the previous range, is bad.

print -P $PS1 | cat -v produces

/home/rl/M-WM-^PM-WM-^QM-WM-^RM-WM-^SM-WM-^TM-WM-^UM-WM-^VM-WM-^WM-WM-^XM-WM-^YM-WM-^ZM-WM-WM-^\M-WM-^]M-WM-^^M-WM-^_M-WM- M-WM-!M-WM-"M-WM-#M-WM-$M-WM-%M-WM-&M-WM-'M-WM-(M-WM-)M-WM-*$

while pwd | cat -v produces

/home/rl/M-WM-^PM-WM-^QM-WM-^RM-WM-^SM-WM-^TM-WM-^UM-WM-^VM-WM-^WM-WM-^XM-WM-^YM-WM-^ZM-WM-^[M-WM-^\M-WM-^]M-WM-^^M-WM-^_M-WM- M-WM-!M-WM-"M-WM-#M-WM-$M-WM-%M-WM-&M-WM-'M-WM-(M-WM-)M-WM-*

It is perhaps hard to see the difference, but a close inspection shows that
the first string contains a solitary M-W between the M-WM-^Z and the the
M-WM-^\ sequences, while the second one contains there the sequence M-WM-^[ ,
i.e., a M-^[, or Meta-Esacpe, was dropped from the string. 

Unfortunately, I didn't find an easy way to put the real prompt on a file, so
I can't tell what is the exact sequences in it.

I hope this make some sense. 

-- 
Dr. Zvi Har'El      mailto:rl@math.technion.ac.il    Department of Mathematics
tel:+972-54-4227607 icq:179294841    Technion - Israel Institute of Technology
fax:+972-4-8293388  http://www.math.technion.ac.il/~rl/    Haifa 32000, ISRAEL
"If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942)
                               Sunday, 7 Elul 5765, 11 September 2005,  1:54PM

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: problem in prompt in utf-8
  2005-09-11 12:13 problem in prompt in utf-8 Zvi Har'El
@ 2005-09-11 16:55 ` Zvi Har'El
  2005-09-11 17:05   ` Zvi Har'El
  2005-09-17 18:15 ` Peter Stephenson
  1 sibling, 1 reply; 6+ messages in thread
From: Zvi Har'El @ 2005-09-11 16:55 UTC (permalink / raw)
  To: Zsh hackers list

I found out that the problem of the prompt (rather then the print -P $PS1), is
TERM dependent. When I set TERM=dumb, the prompt is printed correctly on the
screen, but afterwords the cursor moves to the right several positions which
is seems to be identical in length to the size of the prompt. Since each
character is 2 bytes, I suppose the length is incorrectly calculated.

On Sun, 11 Sep 2005 15:13:45 +0300, Zvi Har'El wrote about "problem in prompt in utf-8":
> Hello,
> 
> I have started using zsh-4.3.0 from the CVS, in a uft-8 locale. I enjoy it
> very much. However, I have a problem with the prompting. This is not new, but
> since the completion now works nicely, I thought I'll mention it, since it is
> not solved yet.
> 
> I have the setting
> 
>     PS1=%/$\ 
> 
> I expect that 
> 
>     print -P $PS1 
> 
> 
> and
> 
>     pwd 
> 
> will give the same output, which will also be the zsh prompt (except the final
> $ and space, of course). However, if the current directory name contains
> hebrew letter, which are in the range U+05D0 to U+05EA, i.e., the utf-8
> sequences have two bytes, with the first one is always 0xD7 (M-W) and the
> second  in the range 0x90 (M-^P) to 0xAA (M-*). I mkdir'ed a directory which
> has all the letters in this range:
> 
> /home/rl$ mkdir אבגדהוזחטיךכלםמןנסעףפץצקרשת
> 
> cd'ed to that directory:
> 
> /home/rl$ cd אבגדהוזחטיךכלםמןנסעףפץצקרשת 
> 
> I got as an echo a correct result:
> 
> ~/אבגדהוזחטיךכלםמןנסעףפץצקרשת
> 
> The next prompt had invalid utf-8 sequences:
> 
> 
> /home/rl/������������לםמןנסעףפץצקרשת$ 
> 
> 
> To make it more specific, all the range U+05D0 to U+05DB,  (second byte 0x90
> to 0x9ba) got invalid. I don't know exactly what is wrong. Notice that 'pwd'
> produces
> 
> /home/rl/אבגדהוזחטיךכלםמןנסעףפץצקרשת
> 
> I.e, all the letters are correct, while 'print -P $PS1' produces
> 
> /home/rl/אבגדהוזחטיך�לםמןנסעףפץצקרשת$
> 
> With exactly one invalid utf-8 seqience, more specifically, U+05DB (second
> byte 0x9ba) - the last one in the previous range, is bad.
> 
> print -P $PS1 | cat -v produces
> 
> /home/rl/M-WM-^PM-WM-^QM-WM-^RM-WM-^SM-WM-^TM-WM-^UM-WM-^VM-WM-^WM-WM-^XM-WM-^YM-WM-^ZM-WM-WM-^\M-WM-^]M-WM-^^M-WM-^_M-WM- M-WM-!M-WM-"M-WM-#M-WM-$M-WM-%M-WM-&M-WM-'M-WM-(M-WM-)M-WM-*$
> 
> while pwd | cat -v produces
> 
> /home/rl/M-WM-^PM-WM-^QM-WM-^RM-WM-^SM-WM-^TM-WM-^UM-WM-^VM-WM-^WM-WM-^XM-WM-^YM-WM-^ZM-WM-^[M-WM-^\M-WM-^]M-WM-^^M-WM-^_M-WM- M-WM-!M-WM-"M-WM-#M-WM-$M-WM-%M-WM-&M-WM-'M-WM-(M-WM-)M-WM-*
> 
> It is perhaps hard to see the difference, but a close inspection shows that
> the first string contains a solitary M-W between the M-WM-^Z and the the
> M-WM-^\ sequences, while the second one contains there the sequence M-WM-^[ ,
> i.e., a M-^[, or Meta-Esacpe, was dropped from the string. 
> 
> Unfortunately, I didn't find an easy way to put the real prompt on a file, so
> I can't tell what is the exact sequences in it.
> 
> I hope this make some sense. 
> 
> -- 
> Dr. Zvi Har'El      mailto:rl@math.technion.ac.il    Department of Mathematics
> tel:+972-54-4227607 icq:179294841    Technion - Israel Institute of Technology
> fax:+972-4-8293388  http://www.math.technion.ac.il/~rl/    Haifa 32000, ISRAEL
> "If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942)
>                                Sunday, 7 Elul 5765, 11 September 2005,  1:54PM

-- 
Dr. Zvi Har'El      mailto:rl@math.technion.ac.il    Department of Mathematics
tel:+972-54-4227607 icq:179294841    Technion - Israel Institute of Technology
fax:+972-4-8293388  http://www.math.technion.ac.il/~rl/    Haifa 32000, ISRAEL
"If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942)
                               Sunday, 8 Elul 5765, 11 September 2005,  7:51PM


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: problem in prompt in utf-8
  2005-09-11 16:55 ` Zvi Har'El
@ 2005-09-11 17:05   ` Zvi Har'El
  0 siblings, 0 replies; 6+ messages in thread
From: Zvi Har'El @ 2005-09-11 17:05 UTC (permalink / raw)
  To: Zsh hackers list; +Cc: Nadav Har'El

Final piece of information: using "screen" and taking a screenlog I have been
able to check that the illegal sequences in the prompt all result from a
solitary 0xD7 (M-W) byte (i.e, the second byte of the sequence disappears.


On Sun, 11 Sep 2005 19:55:16 +0300, Zvi Har'El wrote about "Re: problem in prompt in utf-8":
> I found out that the problem of the prompt (rather then the print -P $PS1), is
> TERM dependent. When I set TERM=dumb, the prompt is printed correctly on the
> screen, but afterwords the cursor moves to the right several positions which
> is seems to be identical in length to the size of the prompt. Since each
> character is 2 bytes, I suppose the length is incorrectly calculated.
> 
> On Sun, 11 Sep 2005 15:13:45 +0300, Zvi Har'El wrote about "problem in prompt in utf-8":
> > Hello,
> > 
> > I have started using zsh-4.3.0 from the CVS, in a uft-8 locale. I enjoy it
> > very much. However, I have a problem with the prompting. This is not new, but
> > since the completion now works nicely, I thought I'll mention it, since it is
> > not solved yet.
> > 
> > I have the setting
> > 
> >     PS1=%/$\ 
> > 
> > I expect that 
> > 
> >     print -P $PS1 
> > 
> > 
> > and
> > 
> >     pwd 
> > 
> > will give the same output, which will also be the zsh prompt (except the final
> > $ and space, of course). However, if the current directory name contains
> > hebrew letter, which are in the range U+05D0 to U+05EA, i.e., the utf-8
> > sequences have two bytes, with the first one is always 0xD7 (M-W) and the
> > second  in the range 0x90 (M-^P) to 0xAA (M-*). I mkdir'ed a directory which
> > has all the letters in this range:
> > 
> > /home/rl$ mkdir אבגדהוזחטיךכלםמןנסעףפץצקרשת
> > 
> > cd'ed to that directory:
> > 
> > /home/rl$ cd אבגדהוזחטיךכלםמןנסעףפץצקרשת 
> > 
> > I got as an echo a correct result:
> > 
> > ~/אבגדהוזחטיךכלםמןנסעףפץצקרשת
> > 
> > The next prompt had invalid utf-8 sequences:
> > 
> > 
> > /home/rl/������������לםמןנסעףפץצקרשת$ 
> > 
> > 
> > To make it more specific, all the range U+05D0 to U+05DB,  (second byte 0x90
> > to 0x9ba) got invalid. I don't know exactly what is wrong. Notice that 'pwd'
> > produces
> > 
> > /home/rl/אבגדהוזחטיךכלםמןנסעףפץצקרשת
> > 
> > I.e, all the letters are correct, while 'print -P $PS1' produces
> > 
> > /home/rl/אבגדהוזחטיך�לםמןנסעףפץצקרשת$
> > 
> > With exactly one invalid utf-8 seqience, more specifically, U+05DB (second
> > byte 0x9ba) - the last one in the previous range, is bad.
> > 
> > print -P $PS1 | cat -v produces
> > 
> > /home/rl/M-WM-^PM-WM-^QM-WM-^RM-WM-^SM-WM-^TM-WM-^UM-WM-^VM-WM-^WM-WM-^XM-WM-^YM-WM-^ZM-WM-WM-^\M-WM-^]M-WM-^^M-WM-^_M-WM- M-WM-!M-WM-"M-WM-#M-WM-$M-WM-%M-WM-&M-WM-'M-WM-(M-WM-)M-WM-*$
> > 
> > while pwd | cat -v produces
> > 
> > /home/rl/M-WM-^PM-WM-^QM-WM-^RM-WM-^SM-WM-^TM-WM-^UM-WM-^VM-WM-^WM-WM-^XM-WM-^YM-WM-^ZM-WM-^[M-WM-^\M-WM-^]M-WM-^^M-WM-^_M-WM- M-WM-!M-WM-"M-WM-#M-WM-$M-WM-%M-WM-&M-WM-'M-WM-(M-WM-)M-WM-*
> > 
> > It is perhaps hard to see the difference, but a close inspection shows that
> > the first string contains a solitary M-W between the M-WM-^Z and the the
> > M-WM-^\ sequences, while the second one contains there the sequence M-WM-^[ ,
> > i.e., a M-^[, or Meta-Esacpe, was dropped from the string. 
> > 
> > Unfortunately, I didn't find an easy way to put the real prompt on a file, so
> > I can't tell what is the exact sequences in it.
> > 
> > I hope this make some sense. 
> > 
> > -- 
> > Dr. Zvi Har'El      mailto:rl@math.technion.ac.il    Department of Mathematics
> > tel:+972-54-4227607 icq:179294841    Technion - Israel Institute of Technology
> > fax:+972-4-8293388  http://www.math.technion.ac.il/~rl/    Haifa 32000, ISRAEL
> > "If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942)
> >                                Sunday, 7 Elul 5765, 11 September 2005,  1:54PM
> 
> -- 
> Dr. Zvi Har'El      mailto:rl@math.technion.ac.il    Department of Mathematics
> tel:+972-54-4227607 icq:179294841    Technion - Israel Institute of Technology
> fax:+972-4-8293388  http://www.math.technion.ac.il/~rl/    Haifa 32000, ISRAEL
> "If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942)
>                                Sunday, 8 Elul 5765, 11 September 2005,  7:51PM

-- 
Dr. Zvi Har'El      mailto:rl@math.technion.ac.il    Department of Mathematics
tel:+972-54-4227607 icq:179294841    Technion - Israel Institute of Technology
fax:+972-4-8293388  http://www.math.technion.ac.il/~rl/    Haifa 32000, ISRAEL
"If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942)
                               Sunday, 8 Elul 5765, 11 September 2005,  8:00PM


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: problem in prompt in utf-8
  2005-09-11 12:13 problem in prompt in utf-8 Zvi Har'El
  2005-09-11 16:55 ` Zvi Har'El
@ 2005-09-17 18:15 ` Peter Stephenson
  2005-09-17 21:33   ` Peter Stephenson
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Stephenson @ 2005-09-17 18:15 UTC (permalink / raw)
  To: Zsh hackers list

"Zvi Har'El" wrote:
> I have started using zsh-4.3.0 from the CVS, in a uft-8 locale. I enjoy it
> very much. However, I have a problem with the prompting. This is not new, but
> since the completion now works nicely, I thought I'll mention it, since it is
> not solved yet.

> /home/rl$ cd אבגדהוזחטיךכלםמןנסעףפץצקרשת 
> 
> The next prompt had invalid utf-8 sequences:
> 
> 
> /home/rl/������������לםמןנסעףפץצקרשת$ 

[This message uses raw 8-bit UTF-8, as the original did; hope this
came through OK, since I hacked the headers by hand.  MH in Emacs is a
bit antiquated.  I'm only surprised my system managed to display Hebrew
characters OK...  It doesn't actually matter apart from the quoted text
above.]

There was an inconsistency when formatting a string that contained a
character in the range reserved for tokens: conversion to the zsh
internal form (metafication) wasn't done correctly.  This particular
problem wasn't actually within zle, it was in the main shell and (as you
sort of indicated) wasn't directly related to multibyte characters.

This should fix the immediate problem, but note that the width of the
prompt isn't calculated correctly yet: we don't scan prompts for
multibyte characters.  Hence you might see oddities with the display
since the shell doesn't know the position of the cursor after the
prompt.  This is another thing on the list of fixes needed in zle.  (It
should come under the "not rocket science" heading, unlike the
completion code, so I hope it will be fixed relatively soon.)

Please do report any more of these inconsistencies; users who regularly
encounter character sets other than latin-based ones are valuable for
this.

I hope I haven't caused any new problems... I think I caught all the
uses of nicechar() and made sure they expected metafied strings.  The
first hunk is tangential to the rest: on the way in, I noticed that the
variable pwd was metafied and so needed to be unmetafied on output.

Index: Src/builtin.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/builtin.c,v
retrieving revision 1.148
diff -u -r1.148 builtin.c
--- Src/builtin.c	9 Sep 2005 16:06:48 -0000	1.148
+++ Src/builtin.c	17 Sep 2005 18:09:24 -0000
@@ -699,7 +699,7 @@
 	else
 	    fmt = " ";
 	if (OPT_ISSET(ops,'l'))
-	    fputs(pwd, stdout);
+	    zputs(pwd, stdout);
 	else
 	    fprintdir(pwd, stdout);
 	for (node = firstnode(dirstack); node; incnode(node)) {
Index: Src/utils.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/utils.c,v
retrieving revision 1.89
diff -u -r1.89 utils.c
--- Src/utils.c	9 Sep 2005 20:34:42 -0000	1.89
+++ Src/utils.c	17 Sep 2005 18:10:08 -0000
@@ -146,7 +146,7 @@
 		putc('%', stderr);
 		break;
 	    case 'c':
-		fputs(nicechar(num), stderr);
+		zputs(nicechar(num), stderr);
 		break;
 	    case 'e':
 		/* print the corresponding message for this errno */
@@ -195,15 +195,21 @@
     return 0;
 }
 
-/* Turn a character into a visible representation thereof.  The visible *
- * string is put together in a static buffer, and this function returns *
- * a pointer to it.  Printable characters stand for themselves, DEL is  *
- * represented as "^?", newline and tab are represented as "\n" and     *
- * "\t", and normal control characters are represented in "^C" form.    *
- * Characters with bit 7 set, if unprintable, are represented as "\M-"  *
- * followed by the visible representation of the character with bit 7   *
- * stripped off.  Tokens are interpreted, rather than being treated as  *
- * literal characters.                                                  */
+/*
+ * Turn a character into a visible representation thereof.  The visible
+ * string is put together in a static buffer, and this function returns
+ * a pointer to it.  Printable characters stand for themselves, DEL is
+ * represented as "^?", newline and tab are represented as "\n" and
+ * "\t", and normal control characters are represented in "^C" form.
+ * Characters with bit 7 set, if unprintable, are represented as "\M-"
+ * followed by the visible representation of the character with bit 7
+ * stripped off.  Tokens are interpreted, rather than being treated as
+ * literal characters.
+ *
+ * Note that the returned string is metafied, so that it must be
+ * treated like any other zsh internal string (and not, for example,
+ * output directly).
+ */
 
 /**/
 mod_export char *
@@ -238,7 +244,17 @@
 	c += 0x40;
     }
     done:
-    *s++ = c;
+    /*
+     * The resulting string is still metafied, so check if
+     * we are returning a character in the range that needs metafication.
+     * This can't happen if the character is printed "nicely", so
+     * this results in a maximum of two bytes total (plus the null).
+     */
+    if (itok(c)) {
+	*s++ = Meta;
+	*s++ = c ^ 32;
+    } else
+	*s++ = c;
     *s = 0;
     return buf;
 }
@@ -292,7 +308,7 @@
 nicefputs(char *s, FILE *f)
 {
     for (; *s; s++)
-	fputs(nicechar(STOUC(*s)), f);
+	zputs(nicechar(STOUC(*s)), f);
 }
 #endif
 
@@ -3177,7 +3193,7 @@
 static char *
 nicedup(char const *s, int heap)
 {
-    int c, len = strlen(s) * 5;
+    int c, len = strlen(s) * 5 + 1;
     VARARR(char, buf, len);
     char *p = buf, *n;
 
@@ -3190,11 +3206,13 @@
 	}
 	if (c == Meta)
 	    c = *s++ ^ 32;
+	/* The result here is metafied */
 	n = nicechar(c);
 	while(*n)
 	    *p++ = *n++;
     }
-    return metafy(buf, p - buf, (heap ? META_HEAPDUP : META_DUP));
+    *p = '\0';
+    return heap ? dupstring(buf) : ztrdup(buf);
 }
 
 /**/
@@ -3228,7 +3246,7 @@
 	}
 	if (c == Meta)
 	    c = *s++ ^ 32;
-	if(fputs(nicechar(c), stream) < 0)
+	if(zputs(nicechar(c), stream) < 0)
 	    return EOF;
     }
     return 0;

-- 
Peter Stephenson <pws@pwstephenson.fsnet.co.uk>
Work: pws@csr.com
Web: http://www.pwstephenson.fsnet.co.uk


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: problem in prompt in utf-8
  2005-09-17 18:15 ` Peter Stephenson
@ 2005-09-17 21:33   ` Peter Stephenson
  2005-09-17 21:51     ` Zvi Har'El
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Stephenson @ 2005-09-17 21:33 UTC (permalink / raw)
  To: Zsh hackers list

Peter Stephenson wrote:
> This should fix the immediate problem, but note that the width of the
> prompt isn't calculated correctly yet: we don't scan prompts for
> multibyte characters.  Hence you might see oddities with the display
> since the shell doesn't know the position of the cursor after the
> prompt.  This is another thing on the list of fixes needed in zle.  (It
> should come under the "not rocket science" heading, unlike the
> completion code, so I hope it will be fixed relatively soon.)

Yeah.

I think this does the trick.  It relies on the fact that we usually print
out the prompt completely, so we don't need to convert it to a wide
character array, just count the characters in it.  We do this because
prompts can have zero-width characters such as terminal escapes.  There
was an optimisation that we could assume everything was hunky dory if
the width was the same as the length of the prompt, but I don't think
that works any more now I'm using wcwidth() for characters in the
prompt.

This may not be rocket science, but it's not trivial either, so there
could well be glitches.

The truncation code (stuff like "%12<...<") doesn't handle multibyte
characters properly yet.  Also, I didn't put wcwidth() anywhere other
than in the prompt width calculation, so characters in the editor
buffers are still assumed to have screen width 1.

Index: Src/prompt.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/prompt.c,v
retrieving revision 1.23
diff -u -r1.23 prompt.c
--- Src/prompt.c	9 Sep 2004 10:12:47 -0000	1.23
+++ Src/prompt.c	17 Sep 2005 21:25:28 -0000
@@ -804,10 +804,15 @@
     return 0;
 }
 
-/* Count height etc. of a prompt string returned by promptexpand(). *
- * This depends on the current terminal width, and tabs and         *
- * newlines require nontrivial processing.                          *
- * Passing `overf' as -1 means to ignore columns (absolute width).  */
+/*
+ * Count height etc. of a prompt string returned by promptexpand().
+ * This depends on the current terminal width, and tabs and
+ * newlines require nontrivial processing.
+ * Passing `overf' as -1 means to ignore columns (absolute width).
+ *
+ * If multibyte is enabled, take account of multibyte characters
+ * by counting 1 for each.
+ */
 
 /**/
 mod_export void
@@ -815,29 +820,96 @@
 {
     int w = 0, h = 1;
     int s = 1;
-    for(; *str; str++) {
-	if(w >= columns && overf >= 0) {
+#ifdef ZLE_UNICODE_SUPPORT
+    int mbret, wcw, multi = 0;
+    char inchar;
+    mbstate_t mbs;
+    wchar_t wc;
+
+    memset(&mbs, 0, sizeof(mbs));
+#endif
+
+    for (; *str; str++) {
+	if (w >= columns && overf >= 0) {
 	    w = 0;
 	    h++;
 	}
-	if(*str == Meta)
-	    str++;
-	if(*str == Inpar)
+	/*
+	 * Input string should be metafied, so tokens in it should
+	 * be real tokens, even if there are multibyte characters.
+	 */
+	if (*str == Inpar)
 	    s = 0;
-	else if(*str == Outpar)
+	else if (*str == Outpar)
 	    s = 1;
-	else if(*str == Nularg)
+	else if (*str == Nularg)
 	    w++;
-	else if(s) {
-	    if(*str == '\t')
-		w = (w | 7) + 1;
-	    else if(*str == '\n') {
-		w = 0;
-		h++;
-	    } else
-		w++;
+	else if (s) {
+	    if (*str == Meta) {
+#ifdef ZLE_UNICODE_SUPPORT
+		inchar = *++str ^ 32;
+#else
+		str++;
+#endif
+	    } else {
+#ifdef ZLE_UNICODE_SUPPORT
+		/*
+		 * Don't look for tab or newline in the middle
+		 * of a multibyte character.  Otherwise, we are
+		 * relying on the character set being an extension
+		 * of ASCII so it's safe to test a single byte.
+		 */
+		if (multi) {
+#endif
+		    if (*str == '\t') {
+			w = (w | 7) + 1;
+			continue;
+		    } else if (*str == '\n') {
+			w = 0;
+			h++;
+			continue;
+		    }
+#ifdef ZLE_UNICODE_SUPPORT
+		}
+
+		inchar = *str;
+#endif
+	    }
+
+#ifdef ZLE_UNICODE_SUPPORT
+	    mbret = mbrtowc(&wc, &inchar, 1, &mbs);
+	    if (mbret >= -1) {
+		if (mbret > 0) {
+		    /*
+		     * If the character isn't printable, this returns -1.
+		     */
+		    wcw = wcwidth(wc);
+		    if (wcw > 0)
+			w += wcw;
+		}
+		/*
+		 * else invalid character or possibly null: assume no
+		 * output
+		 */
+		multi = 0;
+	    } else {
+		/* else character is incomplete, keep looking. */
+		multi = 1;
+	    }
+#else
+	    w++;
+#endif
 	}
     }
+#ifdef ZLE_UNICODE_SUPPORT
+    if (multi) {
+	/*
+	 * oops: incomplete multibyte character.  assume we get a funny
+	 * glyph for single screen column.
+	 */
+	w++;
+    }
+#endif
     if(w >= columns && overf >= 0) {
 	if (!overf || w > columns) {
 	    w = 0;
@@ -901,12 +973,15 @@
 	countprompt(ptr, &w, 0, -1);
 	if (w > trunclen) {
 	    /*
-	     * We need to truncate.  t points to the truncation string -- *
-	     * which is inserted literally, without nice representation.  *
-	     * tlen is its length, and maxlen is the amount of the main	  *
-	     * string that we want to keep.  Note that if the truncation  *
-	     * string is longer than the truncation length (tlen >	  *
-	     * trunclen), the truncation string is used in full.	  *
+	     * We need to truncate.  t points to the truncation string --
+	     * which is inserted literally, without nice representation.
+	     * tlen is its length, and maxlen is the amount of the main
+	     * string that we want to keep.  Note that if the truncation
+	     * string is longer than the truncation length (tlen >
+	     * trunclen), the truncation string is used in full.
+	     *
+	     * TODO: we don't take account of multibyte characters
+	     * in the string we're truncating.
 	     */
 	    char *t = truncstr;
 	    int fullen = bp - ptr;
Index: Src/Zle/zle_refresh.c
===================================================================
RCS file: /cvsroot/zsh/zsh/Src/Zle/zle_refresh.c,v
retrieving revision 1.27
diff -u -r1.27 zle_refresh.c
--- Src/Zle/zle_refresh.c	15 Aug 2005 17:30:58 -0000	1.27
+++ Src/Zle/zle_refresh.c	17 Sep 2005 21:25:57 -0000
@@ -30,7 +30,13 @@
 #include "zle.mdh"
 #include "zle_refresh.pro"
 
-/* Expanded prompts */
+/*
+ * Expanded prompts.
+ *
+ * These are always output from the start, except in the special
+ * case where we are sure each character in the prompt corresponds
+ * to a character on screen.
+ */
 
 /**/
 char *lpromptbuf, *rpromptbuf;
@@ -202,7 +208,9 @@
 	}
     }
 
-    /* TODO currently zsh core is not using widechars */
+    /*
+     * countprompt() now correctly handles multibyte input.
+     */
     countprompt(lpromptbuf, &lpromptwof, &lprompth, 1);
     countprompt(rpromptbuf, &rpromptw, &rprompth, 0);
     if (lpromptwof != winw)
@@ -312,7 +320,11 @@
     oxtabs,			/* oxtabs - tabs expand to spaces if set    */
     numscrolls, onumscrolls;
 
-/* TODO currently it assumes sceenwidth 1 for every character */
+/*
+ * TODO currently it assumes sceenwidth 1 for every character
+ * (except for characters in the prompt which are correctly handled
+ * by wcwidth()).
+ */
 /**/
 mod_export void
 zrefresh(void)
@@ -449,7 +461,7 @@
         if (termflags & TERM_SHORT)
             vcs = 0;
         else if (!clearflag && lpromptbuf[0]) {
-            zputs(lpromptbuf, shout);	/* TODO convert to wide characters */
+            zputs(lpromptbuf, shout);
 	    if (lpromptwof == winw)
 		zputs("\n", shout);	/* works with both hasam and !hasam */
 	} else {
@@ -622,7 +634,6 @@
 	if (trashedzle && opts[TRANSIENTRPROMPT])
 	    put_rpmpt = 0;
 	else
-	    /* TODO (r)promptbuf will be widechar */
 	    put_rpmpt = rprompth == 1 && rpromptbuf[0] &&
 		!strchr(rpromptbuf, '\t') &&
 		(int)ZS_strlen(nbuf[0]) + rpromptw < winw - 1;
@@ -677,7 +688,6 @@
     /* output the right-prompt if appropriate */
 	if (put_rpmpt && !ln && !oput_rpmpt) {
 	    moveto(0, winw - 1 - rpromptw);
-	    /* TODO it will be wide char at some point */
 	    zputs(rpromptbuf, shout);
 	    vcs = winw - 1;
 	/* reset character attributes to that set by the main prompt */
@@ -1114,11 +1124,28 @@
 
 /* otherwise _carefully_ write the contents of the video buffer.
    if we're anywhere in the prompt, goto the left column and write the whole
-   prompt out unless ztrlen(lpromptbuf) == lpromptw : we can cheat then */
+   prompt out.
+
+   If strlen(lpromptbuf) == lpromptw, we can cheat and output
+   the appropriate chunk of the string.  This test relies on the
+   fact that any funny business will always make the length of
+   the string larger than the printing width, so if they're the same
+   we have only ASCII characters or a single-byte extension of ASCII.
+   Unfortunately this trick won't work if there are potentially
+   characters occupying more than one column.  We could flag that
+   this has happened (since it's not that common to have characters
+   wider than one column), but for now it's easier not to use the
+   trick if we are using wcwidth() on the prompt.  It's not that
+   common to be editing in the middle of the prompt anyway, I would
+   think.
+   */
     if (vln == 0 && i < lpromptw && !(termflags & TERM_SHORT)) {
+#ifndef ZLE_UNICODE_SUPPORT
 	if ((int)strlen(lpromptbuf) == lpromptw)
 	    fputs(lpromptbuf + i, shout);
-	else if (tccan(TCRIGHT) && (tclen[TCRIGHT] * ct <= ztrlen(lpromptbuf)))
+	else 
+#endif
+	if (tccan(TCRIGHT) && (tclen[TCRIGHT] * ct <= ztrlen(lpromptbuf)))
 	    /* it is cheaper to send TCRIGHT than reprint the whole prompt */
 	    for (ct = lpromptw - i; ct--; )
 		tcout(TCRIGHT);
@@ -1126,7 +1153,7 @@
 	    if (i != 0)
 		zputc('\r');
 	    tc_upcurs(lprompth - 1);
-	    zputs(lpromptbuf, shout); /* TODO wide character */
+	    zputs(lpromptbuf, shout);
 	    if (lpromptwof == winw)
 		zputs("\n", shout);	/* works with both hasam and !hasam */
 	}
@@ -1238,9 +1265,6 @@
     /*
      * Convert the entire lprompt so that we know how to count
      * characters.
-     *
-     * TODO screen widths are still not correct, indeed lpromptw knows
-     * nothing about multibyte characters so may be too long.
      */
     lpend = strchr(lpromptbuf, 0);
     /* Worst case number of characters, not null-terminated */
@@ -1258,6 +1282,7 @@
 	    /* dunno, try to recover */
 	    lpptr++;
 	    *lpwp++ = ZWC('?');
+	    memset(&ps, '\0', sizeof(ps));
 	}
     }
     if (lpwp - lpwbuf < lpromptw) {

-- 
Peter Stephenson <pws@pwstephenson.fsnet.co.uk>
Work: pws@csr.com
Web: http://www.pwstephenson.fsnet.co.uk


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: problem in prompt in utf-8
  2005-09-17 21:33   ` Peter Stephenson
@ 2005-09-17 21:51     ` Zvi Har'El
  0 siblings, 0 replies; 6+ messages in thread
From: Zvi Har'El @ 2005-09-17 21:51 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Sat, 17 Sep 2005 22:33:35 +0100, Peter Stephenson wrote about "Re: problem in prompt in utf-8":
> Peter Stephenson wrote:
> > This should fix the immediate problem, but note that the width of the
> > prompt isn't calculated correctly yet: we don't scan prompts for
> > multibyte characters.  Hence you might see oddities with the display
> > since the shell doesn't know the position of the cursor after the
> > prompt.  This is another thing on the list of fixes needed in zle.  (It
> > should come under the "not rocket science" heading, unlike the
> > completion code, so I hope it will be fixed relatively soon.)
> 
> Yeah.
> 
> I think this does the trick.  It relies on the fact that we usually print
> out the prompt completely, so we don't need to convert it to a wide
> character array, just count the characters in it.  We do this because
> prompts can have zero-width characters such as terminal escapes.  There
> was an optimisation that we could assume everything was hunky dory if
> the width was the same as the length of the prompt, but I don't think
> that works any more now I'm using wcwidth() for characters in the
> prompt.
> 
> This may not be rocket science, but it's not trivial either, so there
> could well be glitches.
> 
> The truncation code (stuff like "%12<...<") doesn't handle multibyte
> characters properly yet.  Also, I didn't put wcwidth() anywhere other
> than in the prompt width calculation, so characters in the editor
> buffers are still assumed to have screen width 1.

Thanks Peter, I installed your two patches, and they solve all the problems I
described in my email. They also solved a similar problem for other unicode
characters like single quote, U+2019, which I use for directory names, like
"Emperor’s New Clothes". Thanks again,

Zvi.

-- 
Dr. Zvi Har'El      mailto:rl@math.technion.ac.il    Department of Mathematics
tel:+972-54-4227607 icq:179294841    Technion - Israel Institute of Technology
fax:+972-4-8293388  http://www.math.technion.ac.il/~rl/    Haifa 32000, ISRAEL
"If you can't say somethin' nice, don't say nothin' at all." -- Thumper (1942)
                              Sunday, 14 Elul 5765, 18 September 2005, 12:47AM


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2005-09-17 21:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-09-11 12:13 problem in prompt in utf-8 Zvi Har'El
2005-09-11 16:55 ` Zvi Har'El
2005-09-11 17:05   ` Zvi Har'El
2005-09-17 18:15 ` Peter Stephenson
2005-09-17 21:33   ` Peter Stephenson
2005-09-17 21:51     ` Zvi Har'El

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).