bug in completion/expansion of files with LANG=C

zsh-workers
 help / color / mirror / code / Atom feed

* bug in completion/expansion of files with LANG=C
@ 2006-01-06 21:58 Wayne Davison
  2006-01-06 22:40 ` Peter Stephenson
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Wayne Davison @ 2006-01-06 21:58 UTC (permalink / raw)
  To: zsh-workers

I have a file I named hmm.\303\244 (which appears to be a lower-case a
with two dots over it in UTF-8).  However, if I have LANG=C, zsh refuses
to complete the filename in any form on my terminal -- it just stops
after the period (this is with MULTIBYTE_SUPPORT enabled).  If I set
LANG=en_US.UTF-8 (or even en_US.iso88591) the filename will start to
complete (though it displays wrong when being edited on my system that
only supports en_US.iso88591, but I expected that).  When I use a zsh
that has MULTIBYTE_SUPPORT disabled, the name completes and causes the
normal editing-output glitches, as you'd expect.

What should zsh do with characters that are outside the current
character set?  Display them as \M-* values?  A zsh without multibyte
support displays the name as hmm-\M-C\M-$ when being listed by the
completion system, but inserts the name into the command-line as literal
characters.  Perhaps a multibyte-enabled zsh should edit these illegal
characters on the command-line as 4-byte-wide \M-* values; and go back
to displaying them in completion lists that way too?

One more oddity I noticed:

Even on my Ubuntu Linux system that has en_US.UTF-8 and en_GB.UTF-8
support (at least, it lists those when completing "export LANG="),
editing a command-line that included the hmm.\303\244 name never worked
quite right with any setting of LANG.  For instance, using Ctrl-A to get
to the start of the line moved to the right position on the screen, but
pressing Ctrl-F to go forward output the wrong character in each
position (one character too far to the right).  I'm starting to debug
why this is happening...

..wayne..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison
@ 2006-01-06 22:40 ` Peter Stephenson
  2006-01-07  0:59   ` Wayne Davison
  2006-01-07  0:17 ` Wayne Davison
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Peter Stephenson @ 2006-01-06 22:40 UTC (permalink / raw)
  To: zsh-workers

On Fri, 6 Jan 2006 13:58:29 -0800
Wayne Davison <wayned@users.sourceforge.net> wrote:
> What should zsh do with characters that are outside the current
> character set?  Display them as \M-* values?  A zsh without multibyte
> support displays the name as hmm-\M-C\M-$ when being listed by the
> completion system, but inserts the name into the command-line as literal
> characters.  Perhaps a multibyte-enabled zsh should edit these illegal
> characters on the command-line as 4-byte-wide \M-* values; and go back
> to displaying them in completion lists that way too?

There are two difficulties that I can see at the moment.

First, and more fundamentally, we don't have any way of representing
invalid characters now that zle uses wchar_t internally (and that's not
going to change since it works very well).  If we can't convert a
multibyte character to a wchar_t we therefore can't do anything with it.
We would need to add a flag for each character position to indicate that
it contained, say, a byte-wide chunk of a character that we couldn't
convert.  That's a little hairy and messes up any attempt to convert a
complete wide string in one go.

Alternatively, we could convert the characters to a \M- or other
representation on input, but it doesn't help much since we still need
somehow to mark that a group of characters needs converting back to a
byte on output.

The big difference from the old single-byte code is that in that case we
knew that every byte could be treated in that fashion.  It's mixing the
two that's the difficulty.

Second, and less difficult, it's quite a big change to have characters
in the command line displayed differently from the way they naturally
output.  No doubt it could be done, perhaps by an extra stage of mapping
in zrefresh().  It would be quite helpful to have them with some
terminal effect, too.

Once that were done, it wouldn't be too hard to have different sorts of
mapping, so you could pick between a \M- representation and a hex code.

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page still at http://www.pwstephenson.fsnet.co.uk/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison
  2006-01-06 22:40 ` Peter Stephenson
@ 2006-01-07  0:17 ` Wayne Davison
  2006-01-07 22:44 ` Wayne Davison
  2006-01-09  1:42 ` Wayne Davison
  3 siblings, 0 replies; 11+ messages in thread
From: Wayne Davison @ 2006-01-07  0:17 UTC (permalink / raw)
  To: zsh-workers

On Fri, Jan 06, 2006 at 01:58:29PM -0800, Wayne Davison wrote:
> For instance, using Ctrl-A to get to the start of the line moved to
> the right position on the screen, but pressing Ctrl-F to go forward
> output the wrong character in each position (one character too far to
> the right).

This turned out to be a bug in "screen" that occurs when screen got
started with LANG=C, and zsh got changed to use LANG=en_US.UTF-8.  Zsh
was actually outputting an escape sequence to move right one position,
but screen was optimizing it into a more efficient output sequence (the
outputting of a single literal character), and screen thought it was in
a different spot than it really was.  Fortunately, starting screen with
the same LANG value that you want to use inside works properly, so it's
not a really big screen bug, but it might trip up someone else who tries
to use zsh with a changed LANG value inside of screen in the future.

..wayne..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-06 22:40 ` Peter Stephenson
@ 2006-01-07  0:59   ` Wayne Davison
  0 siblings, 0 replies; 11+ messages in thread
From: Wayne Davison @ 2006-01-07  0:59 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh-workers

On Fri, Jan 06, 2006 at 10:40:19PM +0000, Peter Stephenson wrote:
> If we can't convert a multibyte character to a wchar_t we therefore
> can't do anything with it.

One thing we could do would be to transform an invalid value into a
literal '?' -- that would at least match the filename, but it could be
dangerous.

Another thing we could do is to pick a legal wide-character value that
would be used to indicate that the real value for this character was
stored in a parallel (byte-sized) array.  For instance, choosing the
delete character would be one possibility (obviously making it necessary
to turn a real \177 char into a \177 value in the parallel array).  Yes,
this would complicate the code a good bit, but I think we need to do
something to make it possible for people to complete all filenames that
may be on their system.

> Second, and less difficult, it's quite a big change to have characters
> in the command line displayed differently from the way they naturally
> output.

No, there's already a precedence for this:  control characters.  To
handle this, zsh has a refresh string that is built up from the command-
line string, and adding one extra rule to the code in zrefresh() that
would make these invalid characters display as multiple physical
characters would actually be pretty easy.

> It would be quite helpful to have them with some terminal effect, too.

That would certainly be nice, and could be used to make the display of
literal control characters better.  The simplest way to add it would
probably be to extend the rparams structure to add some per-character
flags that zrefresh() would set and refreshline() would obey.

An alternate possibility would be to use the idiom "^123" to output the
octal value of the invalid byte.  (I chose '^' because it is already
special for displaying control characters, and ctrl-digit values aren't
already used to indicate anything.)

..wayne..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison
  2006-01-06 22:40 ` Peter Stephenson
  2006-01-07  0:17 ` Wayne Davison
@ 2006-01-07 22:44 ` Wayne Davison
  2006-01-08  5:56   ` Bart Schaefer
  2006-01-09  1:42 ` Wayne Davison
  3 siblings, 1 reply; 11+ messages in thread
From: Wayne Davison @ 2006-01-07 22:44 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]

Here's a patch to fix the first part of having non-convertable
characters found when completing filenames:  we now output the full
filename string instead of a truncated version of it.  I chose to just
call nicechar() to have it output a \M-x sequence (to make it the same
as the non-multibyte code).

However, I was wondering if it would be nicer to switch both the old and
the new code over to using a 3-digit octal value in such output?  E.g.
output "\303" instead of "\M-C", and "\201" instead of "\M-^A"?

..wayne..

[-- Attachment #2: mb_niceformat.patch --]
[-- Type: text/plain, Size: 931 bytes --]

--- Src/utils.c	15 Dec 2005 14:51:41 -0000	1.108
+++ Src/utils.c	7 Jan 2006 22:36:34 -0000
@@ -3475,22 +3475,21 @@ mb_niceformat(const char *s, FILE *strea
 
 	if (ret == (size_t)-1 || ret == (size_t)-2)
 	{
+	    /* The byte didn't convert, so output it as a \M-x sequence. */
+	    fmt = nicechar(*(unsigned char*)ptr);
+	    ret = newl = 1;
+	} else {
 	    /*
-	     * We're a bit stuck here.  I suppose we could
-	     * just stick with \M-... for the individual bytes.
+	     * careful in case converting NULL returned 0: NULLs are
+	     * real characters for us.
 	     */
-	    break;
+	    if (c == L'\0' && ret == 0)
+		ret = 1;
+	    fmt = wcs_nicechar(c, &newl, NULL);
 	}
-	/*
-	 * careful in case converting NULL returned 0: NULLs are real
-	 * characters for us.
-	 */
-	if (c == L'\0' && ret == 0)
-	    ret = 1;
+
 	umlen -= ret;
 	ptr += ret;
-
-	fmt = wcs_nicechar(c, &newl, NULL);
 	l += newl;
 
 	if (stream)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-07 22:44 ` Wayne Davison
@ 2006-01-08  5:56   ` Bart Schaefer
  2006-01-08  8:06     ` Wayne Davison
  0 siblings, 1 reply; 11+ messages in thread
From: Bart Schaefer @ 2006-01-08  5:56 UTC (permalink / raw)
  To: zsh-workers

On Jan 7,  2:44pm, Wayne Davison wrote:
}
} However, I was wondering if it would be nicer to switch both the old and
} the new code over to using a 3-digit octal value in such output?  E.g.
} output "\303" instead of "\M-C", and "\201" instead of "\M-^A"?

I prefer the \M- form because it gives you some clue what you should do
to generate the equivalent value from the keyboard.  Similarly, I think
the output should continue to resemble that of "bindkey -L".

I wouldn't be hearbroken either way, though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-08  5:56   ` Bart Schaefer
@ 2006-01-08  8:06     ` Wayne Davison
  2006-01-08 18:03       ` Peter Stephenson
  0 siblings, 1 reply; 11+ messages in thread
From: Wayne Davison @ 2006-01-08  8:06 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On Sun, Jan 08, 2006 at 05:56:20AM +0000, Bart Schaefer wrote:
> I prefer the \M- form because it gives you some clue what you should do
> to generate the equivalent value from the keyboard.

Fair enough -- let's just leave it alone, then.

As for my patch in the grandparent email, I noticed some problems with
it:  the manpage for mbrtowc() says that the state of the mbstate_t
object is undefined after the function returns -1, so the code should
reset it to a known state.  When the function returns -2, it means the
code scanned to the end of the string without finding the end of a wide
character, so perhaps we should treat all the remaining characters as
invalid?  I'm not certain that's the correct thing to do, so I'll leave
the code handling -2 the same way as -1 for now.  Finally, I wasn't
setting the right visible width for the \M-... string (I had mistakenly
hardwired it to "1").

While twiddling these things I noticed a couple other things that I
think could be improved:

1. It looks to me like the code in wcs_nicechar() that calls
wcswidth(&c, 1) could really just call wcwidth(c), right?  If not,
what am I missing?

2. The code in mb_niceformat() calls strlen() on the "fmt" string
returned by wcs_nicechar(), but it seems to me that it could just use
the width that wcs_nicechar() returned, right?

Attached is an updated version of my patch that fixes the aforementioned
bugs and implements the 2 improvements.

..wayne..

[-- Attachment #2: mb_niceformat.patch --]
[-- Type: text/plain, Size: 1992 bytes --]

--- Src/utils.c	15 Dec 2005 14:51:41 -0000	1.108
+++ Src/utils.c	8 Jan 2006 07:55:56 -0000
@@ -375,7 +375,7 @@ wcs_nicechar(wchar_t c, size_t *widthp, 
     }

     if (widthp)
-	*widthp = (s - buf) + wcswidth(&c, 1);
+	*widthp = (s - buf) + wcwidth(c);
     if (swidep)
 	*swidep = s;
     for (mbptr = mbstr; ret; s++, mbptr++, ret--) {
@@ -3446,8 +3446,8 @@ niceztrlen(char const *s)
 mod_export size_t
 mb_niceformat(const char *s, FILE *stream, char **outstrp, int heap)
 {
-    size_t l = 0, newl, ret;
-    int umlen, outalloc, outleft;
+    size_t l = 0, outlen, outleft, ret;
+    int umlen, outalloc;
     wchar_t c;
     char *ums, *ptr, *fmt, *outstr, *outptr;
     mbstate_t ps;
@@ -3473,31 +3473,31 @@ mb_niceformat(const char *s, FILE *strea
     while (umlen > 0) {
 	ret = mbrtowc(&c, ptr, umlen, &ps);

-	if (ret == (size_t)-1 || ret == (size_t)-2)
-	{
-	    /*
-	     * We're a bit stuck here.  I suppose we could
-	     * just stick with \M-... for the individual bytes.
-	     */
-	    break;
-	}
-	/*
-	 * careful in case converting NULL returned 0: NULLs are real
-	 * characters for us.
-	 */
-	if (c == L'\0' && ret == 0)
+	if (ret != (size_t)-1 && ret != (size_t)-2) {
+	    /* Careful:  converting '\0' returns 0, but a '\0' is a
+	     * real character for us, so we should consume 1 byte. */
+	    if (c == L'\0')
+		ret = 1;
+
+	    fmt = wcs_nicechar(c, &outlen, NULL);
+	} else {
+	    /* Get ps out of its undefined state. */
+	    memset(&ps, 0, sizeof ps);
 	    ret = 1;
+
+	    /* The byte didn't convert, so output it as a \M-... sequence. */
+	    fmt = nicechar(*(unsigned char*)ptr);
+	    outlen = strlen(fmt);
+	}
+
 	umlen -= ret;
 	ptr += ret;
-
-	fmt = wcs_nicechar(c, &newl, NULL);
-	l += newl;
+	l += outlen;

 	if (stream)
 	    zputs(fmt, stream);
 	if (outstr) {
 	    /* Append to output string */
-	    int outlen = strlen(fmt);
 	    if (outlen >= outleft) {
 		/* Reallocate to twice the length */
 		int outoffset = outptr - outstr;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-08  8:06     ` Wayne Davison
@ 2006-01-08 18:03       ` Peter Stephenson
  2006-01-08 23:16         ` Wayne Davison
  2006-01-12  1:26         ` Wayne Davison
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Stephenson @ 2006-01-08 18:03 UTC (permalink / raw)
  To: zsh-workers

On Sun, 8 Jan 2006 00:06:21 -0800
Wayne Davison <wayned@users.sourceforge.net> wrote:
> As for my patch in the grandparent email, I noticed some problems with
> it:  the manpage for mbrtowc() says that the state of the mbstate_t
> object is undefined after the function returns -1, so the code should
> reset it to a known state.  When the function returns -2, it means the
> code scanned to the end of the string without finding the end of a wide
> character, so perhaps we should treat all the remaining characters as
> invalid?

You mean output everything remaining in the string as special codes
rather than real (multibyte) characters?  Yes, that would make sense.

> 1. It looks to me like the code in wcs_nicechar() that calls
> wcswidth(&c, 1) could really just call wcwidth(c), right?  If not,
> what am I missing?

Yes, and it's the only occurrence of wcswidth(), so it would make sense
to remove it.  This would make my patch for probing for wcswidth()
redundant.

> 2. The code in mb_niceformat() calls strlen() on the "fmt" string
> returned by wcs_nicechar(), but it seems to me that it could just use
> the width that wcs_nicechar() returned, right?

I think it really needs the length of the string here.  The width
produced by wcs_nicechar() is a printing width, which isn't the same.

-- 
Peter Stephenson <p.w.stephenson@ntlworld.com>
Web page still at http://www.pwstephenson.fsnet.co.uk/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-08 18:03       ` Peter Stephenson
@ 2006-01-08 23:16         ` Wayne Davison
  2006-01-12  1:26         ` Wayne Davison
  1 sibling, 0 replies; 11+ messages in thread
From: Wayne Davison @ 2006-01-08 23:16 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh-workers

On Sun, Jan 08, 2006 at 06:03:09PM +0000, Peter Stephenson wrote:
> You mean output everything remaining in the string as special codes
> rather than real (multibyte) characters?  Yes, that would make sense.

I didn't code that up yet, but I did commit an improved version of
mb_niceformat() that avoids truncating a name at the first erroneous
character (taking into account your other feedback on my prior patch).

> This would make my patch for probing for wcswidth() redundant.

Cool -- I was hoping that would be the case.  I got rid of all the code
that used wcswidth().

> I think it really needs the length of the string here.  The width
> produced by wcs_nicechar() is a printing width, which isn't the same.

I see that I misread what the returned width was measuring.  Thanks!

..wayne..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison
                   ` (2 preceding siblings ...)
  2006-01-07 22:44 ` Wayne Davison
@ 2006-01-09  1:42 ` Wayne Davison
  3 siblings, 0 replies; 11+ messages in thread
From: Wayne Davison @ 2006-01-09  1:42 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 1446 bytes --]

I looked around at the other calls to mbrtowc() in the code, and cleaned
up a few things:

1. I made all the code assign the return value to a size_t, not an int.
This should prevent a failure on a system where the size of an int is
larger than the size of a size_t (since the (size_t)-1 and (size_t)-2
values won't get converted into negative numbers if that is the case).

2. I added STOUC() around a couple char args that were getting passed to
nicechar() when mbrtowc() failed.

3. One of the calls needed to reset the mbstate_t object when continuing
to parse the string after mbrtowc() failed.

4. The code in sub_match() (in Src/Zle/compmatch.c) had a bug when it
assembled a wide-char value from multiple bytes (decoded from a metafied
string):  the code was not advancing past all the raw values used if
there was a meta char or a multibyte character sequence (and I think a
'\0' byte might have even looped infinitely).

After doing all that, it was time to begin looking at the next stage of
the non-inputable-filename problem: I decided to do the easiest possible
change first, so attached is a patch that causes zsh to insert a literal
question-mark in place of each errant character.  This allows me to at
least match a name that cannot be input into the command-line, but it
could be a little dangerous if it happens to match other filenames too,
so this is probably not something we'd want to use in an actual release.

..wayne..

[-- Attachment #2: questionable.patch --]
[-- Type: text/plain, Size: 771 bytes --]

--- Src/Zle/zle_utils.c	9 Jan 2006 00:29:57 -0000	1.34
+++ Src/Zle/zle_utils.c	9 Jan 2006 01:20:35 -0000
@@ -277,13 +277,13 @@ stringaszleline(char *instr, int incs, i
 	while (ll > 0) {
 	    size_t ret = mbrtowc(outptr, inptr, ll, &ps);

-	    /*
-	     * At this point we don't handle either incomplete (-2) or
-	     * invalid (-1) multibyte sequences.  Use the current length
-	     * and return.
-	     */
-	    if (ret == (size_t)-1 || ret == (size_t)-2)
-		break;
+	    if (ret == (size_t)-1 || ret == (size_t)-2) {
+		/* Transform invalid character sequences into literal
+		 * question marks, at least for now... */
+		*outptr = L'?';
+		ret = 1;
+		memset(&ps, '\0', sizeof ps);
+	    }

 	    /*
 	     * Careful: converting a wide NUL returns zero, but we

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: bug in completion/expansion of files with LANG=C
  2006-01-08 18:03       ` Peter Stephenson
  2006-01-08 23:16         ` Wayne Davison
@ 2006-01-12  1:26         ` Wayne Davison
  1 sibling, 0 replies; 11+ messages in thread
From: Wayne Davison @ 2006-01-12  1:26 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh-workers

On Sun, Jan 08, 2006 at 06:03:09PM +0000, Peter Stephenson wrote:
> You mean output everything remaining in the string as special codes
> rather than real (multibyte) characters?  Yes, that would make sense.

OK, I've made the code do this in all the instances where a -2 means
that it scanned clear to the end of the string (not when we're
converting a single input byte at a time).

Some other changes that I made:

 - A few more places in the code now treat the size_t return value from
   mbrtowc() and mbsrtowcs() as a size_t.

 - A spot in utils.c was treating the return of wctomb() as a size_t
   instead of an int.

 - Defined MB_INCOMPLETE & MB_INVALID to be used instead of some literal
   (size_t)-2 and (size_t)-1 values.

 - Tweaked some variable names to make them more consistent (e.g. we use
   "mbs" for the multibyte state everywhere instead of sometimes mbs,
   and sometimes ps).

 - Found a couple more places that needed to reset "mbs" on error.

A diff of what I changed can be found here for those who wish to see it:

    http://opencoder.net/zsh-multibyte.diff

..wayne..


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-01-12  1:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison
2006-01-06 22:40 ` Peter Stephenson
2006-01-07  0:59   ` Wayne Davison
2006-01-07  0:17 ` Wayne Davison
2006-01-07 22:44 ` Wayne Davison
2006-01-08  5:56   ` Bart Schaefer
2006-01-08  8:06     ` Wayne Davison
2006-01-08 18:03       ` Peter Stephenson
2006-01-08 23:16         ` Wayne Davison
2006-01-12  1:26         ` Wayne Davison
2006-01-09  1:42 ` Wayne Davison

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).