* bug in completion/expansion of files with LANG=C @ 2006-01-06 21:58 Wayne Davison 2006-01-06 22:40 ` Peter Stephenson ` (3 more replies) 0 siblings, 4 replies; 11+ messages in thread From: Wayne Davison @ 2006-01-06 21:58 UTC (permalink / raw) To: zsh-workers I have a file I named hmm.\303\244 (which appears to be a lower-case a with two dots over it in UTF-8). However, if I have LANG=C, zsh refuses to complete the filename in any form on my terminal -- it just stops after the period (this is with MULTIBYTE_SUPPORT enabled). If I set LANG=en_US.UTF-8 (or even en_US.iso88591) the filename will start to complete (though it displays wrong when being edited on my system that only supports en_US.iso88591, but I expected that). When I use a zsh that has MULTIBYTE_SUPPORT disabled, the name completes and causes the normal editing-output glitches, as you'd expect. What should zsh do with characters that are outside the current character set? Display them as \M-* values? A zsh without multibyte support displays the name as hmm-\M-C\M-$ when being listed by the completion system, but inserts the name into the command-line as literal characters. Perhaps a multibyte-enabled zsh should edit these illegal characters on the command-line as 4-byte-wide \M-* values; and go back to displaying them in completion lists that way too? One more oddity I noticed: Even on my Ubuntu Linux system that has en_US.UTF-8 and en_GB.UTF-8 support (at least, it lists those when completing "export LANG="), editing a command-line that included the hmm.\303\244 name never worked quite right with any setting of LANG. For instance, using Ctrl-A to get to the start of the line moved to the right position on the screen, but pressing Ctrl-F to go forward output the wrong character in each position (one character too far to the right). I'm starting to debug why this is happening... ..wayne.. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison @ 2006-01-06 22:40 ` Peter Stephenson 2006-01-07 0:59 ` Wayne Davison 2006-01-07 0:17 ` Wayne Davison ` (2 subsequent siblings) 3 siblings, 1 reply; 11+ messages in thread From: Peter Stephenson @ 2006-01-06 22:40 UTC (permalink / raw) To: zsh-workers On Fri, 6 Jan 2006 13:58:29 -0800 Wayne Davison <wayned@users.sourceforge.net> wrote: > What should zsh do with characters that are outside the current > character set? Display them as \M-* values? A zsh without multibyte > support displays the name as hmm-\M-C\M-$ when being listed by the > completion system, but inserts the name into the command-line as literal > characters. Perhaps a multibyte-enabled zsh should edit these illegal > characters on the command-line as 4-byte-wide \M-* values; and go back > to displaying them in completion lists that way too? There are two difficulties that I can see at the moment. First, and more fundamentally, we don't have any way of representing invalid characters now that zle uses wchar_t internally (and that's not going to change since it works very well). If we can't convert a multibyte character to a wchar_t we therefore can't do anything with it. We would need to add a flag for each character position to indicate that it contained, say, a byte-wide chunk of a character that we couldn't convert. That's a little hairy and messes up any attempt to convert a complete wide string in one go. Alternatively, we could convert the characters to a \M- or other representation on input, but it doesn't help much since we still need somehow to mark that a group of characters needs converting back to a byte on output. The big difference from the old single-byte code is that in that case we knew that every byte could be treated in that fashion. It's mixing the two that's the difficulty. Second, and less difficult, it's quite a big change to have characters in the command line displayed differently from the way they naturally output. No doubt it could be done, perhaps by an extra stage of mapping in zrefresh(). It would be quite helpful to have them with some terminal effect, too. Once that were done, it wouldn't be too hard to have different sorts of mapping, so you could pick between a \M- representation and a hex code. -- Peter Stephenson <p.w.stephenson@ntlworld.com> Web page still at http://www.pwstephenson.fsnet.co.uk/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-06 22:40 ` Peter Stephenson @ 2006-01-07 0:59 ` Wayne Davison 0 siblings, 0 replies; 11+ messages in thread From: Wayne Davison @ 2006-01-07 0:59 UTC (permalink / raw) To: Peter Stephenson; +Cc: zsh-workers On Fri, Jan 06, 2006 at 10:40:19PM +0000, Peter Stephenson wrote: > If we can't convert a multibyte character to a wchar_t we therefore > can't do anything with it. One thing we could do would be to transform an invalid value into a literal '?' -- that would at least match the filename, but it could be dangerous. Another thing we could do is to pick a legal wide-character value that would be used to indicate that the real value for this character was stored in a parallel (byte-sized) array. For instance, choosing the delete character would be one possibility (obviously making it necessary to turn a real \177 char into a \177 value in the parallel array). Yes, this would complicate the code a good bit, but I think we need to do something to make it possible for people to complete all filenames that may be on their system. > Second, and less difficult, it's quite a big change to have characters > in the command line displayed differently from the way they naturally > output. No, there's already a precedence for this: control characters. To handle this, zsh has a refresh string that is built up from the command- line string, and adding one extra rule to the code in zrefresh() that would make these invalid characters display as multiple physical characters would actually be pretty easy. > It would be quite helpful to have them with some terminal effect, too. That would certainly be nice, and could be used to make the display of literal control characters better. The simplest way to add it would probably be to extend the rparams structure to add some per-character flags that zrefresh() would set and refreshline() would obey. An alternate possibility would be to use the idiom "^123" to output the octal value of the invalid byte. (I chose '^' because it is already special for displaying control characters, and ctrl-digit values aren't already used to indicate anything.) ..wayne.. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison 2006-01-06 22:40 ` Peter Stephenson @ 2006-01-07 0:17 ` Wayne Davison 2006-01-07 22:44 ` Wayne Davison 2006-01-09 1:42 ` Wayne Davison 3 siblings, 0 replies; 11+ messages in thread From: Wayne Davison @ 2006-01-07 0:17 UTC (permalink / raw) To: zsh-workers On Fri, Jan 06, 2006 at 01:58:29PM -0800, Wayne Davison wrote: > For instance, using Ctrl-A to get to the start of the line moved to > the right position on the screen, but pressing Ctrl-F to go forward > output the wrong character in each position (one character too far to > the right). This turned out to be a bug in "screen" that occurs when screen got started with LANG=C, and zsh got changed to use LANG=en_US.UTF-8. Zsh was actually outputting an escape sequence to move right one position, but screen was optimizing it into a more efficient output sequence (the outputting of a single literal character), and screen thought it was in a different spot than it really was. Fortunately, starting screen with the same LANG value that you want to use inside works properly, so it's not a really big screen bug, but it might trip up someone else who tries to use zsh with a changed LANG value inside of screen in the future. ..wayne.. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison 2006-01-06 22:40 ` Peter Stephenson 2006-01-07 0:17 ` Wayne Davison @ 2006-01-07 22:44 ` Wayne Davison 2006-01-08 5:56 ` Bart Schaefer 2006-01-09 1:42 ` Wayne Davison 3 siblings, 1 reply; 11+ messages in thread From: Wayne Davison @ 2006-01-07 22:44 UTC (permalink / raw) To: zsh-workers [-- Attachment #1: Type: text/plain, Size: 521 bytes --] Here's a patch to fix the first part of having non-convertable characters found when completing filenames: we now output the full filename string instead of a truncated version of it. I chose to just call nicechar() to have it output a \M-x sequence (to make it the same as the non-multibyte code). However, I was wondering if it would be nicer to switch both the old and the new code over to using a 3-digit octal value in such output? E.g. output "\303" instead of "\M-C", and "\201" instead of "\M-^A"? ..wayne.. [-- Attachment #2: mb_niceformat.patch --] [-- Type: text/plain, Size: 931 bytes --] --- Src/utils.c 15 Dec 2005 14:51:41 -0000 1.108 +++ Src/utils.c 7 Jan 2006 22:36:34 -0000 @@ -3475,22 +3475,21 @@ mb_niceformat(const char *s, FILE *strea if (ret == (size_t)-1 || ret == (size_t)-2) { + /* The byte didn't convert, so output it as a \M-x sequence. */ + fmt = nicechar(*(unsigned char*)ptr); + ret = newl = 1; + } else { /* - * We're a bit stuck here. I suppose we could - * just stick with \M-... for the individual bytes. + * careful in case converting NULL returned 0: NULLs are + * real characters for us. */ - break; + if (c == L'\0' && ret == 0) + ret = 1; + fmt = wcs_nicechar(c, &newl, NULL); } - /* - * careful in case converting NULL returned 0: NULLs are real - * characters for us. - */ - if (c == L'\0' && ret == 0) - ret = 1; + umlen -= ret; ptr += ret; - - fmt = wcs_nicechar(c, &newl, NULL); l += newl; if (stream) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-07 22:44 ` Wayne Davison @ 2006-01-08 5:56 ` Bart Schaefer 2006-01-08 8:06 ` Wayne Davison 0 siblings, 1 reply; 11+ messages in thread From: Bart Schaefer @ 2006-01-08 5:56 UTC (permalink / raw) To: zsh-workers On Jan 7, 2:44pm, Wayne Davison wrote: } } However, I was wondering if it would be nicer to switch both the old and } the new code over to using a 3-digit octal value in such output? E.g. } output "\303" instead of "\M-C", and "\201" instead of "\M-^A"? I prefer the \M- form because it gives you some clue what you should do to generate the equivalent value from the keyboard. Similarly, I think the output should continue to resemble that of "bindkey -L". I wouldn't be hearbroken either way, though. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-08 5:56 ` Bart Schaefer @ 2006-01-08 8:06 ` Wayne Davison 2006-01-08 18:03 ` Peter Stephenson 0 siblings, 1 reply; 11+ messages in thread From: Wayne Davison @ 2006-01-08 8:06 UTC (permalink / raw) To: Bart Schaefer; +Cc: zsh-workers [-- Attachment #1: Type: text/plain, Size: 1448 bytes --] On Sun, Jan 08, 2006 at 05:56:20AM +0000, Bart Schaefer wrote: > I prefer the \M- form because it gives you some clue what you should do > to generate the equivalent value from the keyboard. Fair enough -- let's just leave it alone, then. As for my patch in the grandparent email, I noticed some problems with it: the manpage for mbrtowc() says that the state of the mbstate_t object is undefined after the function returns -1, so the code should reset it to a known state. When the function returns -2, it means the code scanned to the end of the string without finding the end of a wide character, so perhaps we should treat all the remaining characters as invalid? I'm not certain that's the correct thing to do, so I'll leave the code handling -2 the same way as -1 for now. Finally, I wasn't setting the right visible width for the \M-... string (I had mistakenly hardwired it to "1"). While twiddling these things I noticed a couple other things that I think could be improved: 1. It looks to me like the code in wcs_nicechar() that calls wcswidth(&c, 1) could really just call wcwidth(c), right? If not, what am I missing? 2. The code in mb_niceformat() calls strlen() on the "fmt" string returned by wcs_nicechar(), but it seems to me that it could just use the width that wcs_nicechar() returned, right? Attached is an updated version of my patch that fixes the aforementioned bugs and implements the 2 improvements. ..wayne.. [-- Attachment #2: mb_niceformat.patch --] [-- Type: text/plain, Size: 1992 bytes --] --- Src/utils.c 15 Dec 2005 14:51:41 -0000 1.108 +++ Src/utils.c 8 Jan 2006 07:55:56 -0000 @@ -375,7 +375,7 @@ wcs_nicechar(wchar_t c, size_t *widthp, } if (widthp) - *widthp = (s - buf) + wcswidth(&c, 1); + *widthp = (s - buf) + wcwidth(c); if (swidep) *swidep = s; for (mbptr = mbstr; ret; s++, mbptr++, ret--) { @@ -3446,8 +3446,8 @@ niceztrlen(char const *s) mod_export size_t mb_niceformat(const char *s, FILE *stream, char **outstrp, int heap) { - size_t l = 0, newl, ret; - int umlen, outalloc, outleft; + size_t l = 0, outlen, outleft, ret; + int umlen, outalloc; wchar_t c; char *ums, *ptr, *fmt, *outstr, *outptr; mbstate_t ps; @@ -3473,31 +3473,31 @@ mb_niceformat(const char *s, FILE *strea while (umlen > 0) { ret = mbrtowc(&c, ptr, umlen, &ps); - if (ret == (size_t)-1 || ret == (size_t)-2) - { - /* - * We're a bit stuck here. I suppose we could - * just stick with \M-... for the individual bytes. - */ - break; - } - /* - * careful in case converting NULL returned 0: NULLs are real - * characters for us. - */ - if (c == L'\0' && ret == 0) + if (ret != (size_t)-1 && ret != (size_t)-2) { + /* Careful: converting '\0' returns 0, but a '\0' is a + * real character for us, so we should consume 1 byte. */ + if (c == L'\0') + ret = 1; + + fmt = wcs_nicechar(c, &outlen, NULL); + } else { + /* Get ps out of its undefined state. */ + memset(&ps, 0, sizeof ps); ret = 1; + + /* The byte didn't convert, so output it as a \M-... sequence. */ + fmt = nicechar(*(unsigned char*)ptr); + outlen = strlen(fmt); + } + umlen -= ret; ptr += ret; - - fmt = wcs_nicechar(c, &newl, NULL); - l += newl; + l += outlen; if (stream) zputs(fmt, stream); if (outstr) { /* Append to output string */ - int outlen = strlen(fmt); if (outlen >= outleft) { /* Reallocate to twice the length */ int outoffset = outptr - outstr; ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-08 8:06 ` Wayne Davison @ 2006-01-08 18:03 ` Peter Stephenson 2006-01-08 23:16 ` Wayne Davison 2006-01-12 1:26 ` Wayne Davison 0 siblings, 2 replies; 11+ messages in thread From: Peter Stephenson @ 2006-01-08 18:03 UTC (permalink / raw) To: zsh-workers On Sun, 8 Jan 2006 00:06:21 -0800 Wayne Davison <wayned@users.sourceforge.net> wrote: > As for my patch in the grandparent email, I noticed some problems with > it: the manpage for mbrtowc() says that the state of the mbstate_t > object is undefined after the function returns -1, so the code should > reset it to a known state. When the function returns -2, it means the > code scanned to the end of the string without finding the end of a wide > character, so perhaps we should treat all the remaining characters as > invalid? You mean output everything remaining in the string as special codes rather than real (multibyte) characters? Yes, that would make sense. > 1. It looks to me like the code in wcs_nicechar() that calls > wcswidth(&c, 1) could really just call wcwidth(c), right? If not, > what am I missing? Yes, and it's the only occurrence of wcswidth(), so it would make sense to remove it. This would make my patch for probing for wcswidth() redundant. > 2. The code in mb_niceformat() calls strlen() on the "fmt" string > returned by wcs_nicechar(), but it seems to me that it could just use > the width that wcs_nicechar() returned, right? I think it really needs the length of the string here. The width produced by wcs_nicechar() is a printing width, which isn't the same. -- Peter Stephenson <p.w.stephenson@ntlworld.com> Web page still at http://www.pwstephenson.fsnet.co.uk/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-08 18:03 ` Peter Stephenson @ 2006-01-08 23:16 ` Wayne Davison 2006-01-12 1:26 ` Wayne Davison 1 sibling, 0 replies; 11+ messages in thread From: Wayne Davison @ 2006-01-08 23:16 UTC (permalink / raw) To: Peter Stephenson; +Cc: zsh-workers On Sun, Jan 08, 2006 at 06:03:09PM +0000, Peter Stephenson wrote: > You mean output everything remaining in the string as special codes > rather than real (multibyte) characters? Yes, that would make sense. I didn't code that up yet, but I did commit an improved version of mb_niceformat() that avoids truncating a name at the first erroneous character (taking into account your other feedback on my prior patch). > This would make my patch for probing for wcswidth() redundant. Cool -- I was hoping that would be the case. I got rid of all the code that used wcswidth(). > I think it really needs the length of the string here. The width > produced by wcs_nicechar() is a printing width, which isn't the same. I see that I misread what the returned width was measuring. Thanks! ..wayne.. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-08 18:03 ` Peter Stephenson 2006-01-08 23:16 ` Wayne Davison @ 2006-01-12 1:26 ` Wayne Davison 1 sibling, 0 replies; 11+ messages in thread From: Wayne Davison @ 2006-01-12 1:26 UTC (permalink / raw) To: Peter Stephenson; +Cc: zsh-workers On Sun, Jan 08, 2006 at 06:03:09PM +0000, Peter Stephenson wrote: > You mean output everything remaining in the string as special codes > rather than real (multibyte) characters? Yes, that would make sense. OK, I've made the code do this in all the instances where a -2 means that it scanned clear to the end of the string (not when we're converting a single input byte at a time). Some other changes that I made: - A few more places in the code now treat the size_t return value from mbrtowc() and mbsrtowcs() as a size_t. - A spot in utils.c was treating the return of wctomb() as a size_t instead of an int. - Defined MB_INCOMPLETE & MB_INVALID to be used instead of some literal (size_t)-2 and (size_t)-1 values. - Tweaked some variable names to make them more consistent (e.g. we use "mbs" for the multibyte state everywhere instead of sometimes mbs, and sometimes ps). - Found a couple more places that needed to reset "mbs" on error. A diff of what I changed can be found here for those who wish to see it: http://opencoder.net/zsh-multibyte.diff ..wayne.. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: bug in completion/expansion of files with LANG=C 2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison ` (2 preceding siblings ...) 2006-01-07 22:44 ` Wayne Davison @ 2006-01-09 1:42 ` Wayne Davison 3 siblings, 0 replies; 11+ messages in thread From: Wayne Davison @ 2006-01-09 1:42 UTC (permalink / raw) To: zsh-workers [-- Attachment #1: Type: text/plain, Size: 1446 bytes --] I looked around at the other calls to mbrtowc() in the code, and cleaned up a few things: 1. I made all the code assign the return value to a size_t, not an int. This should prevent a failure on a system where the size of an int is larger than the size of a size_t (since the (size_t)-1 and (size_t)-2 values won't get converted into negative numbers if that is the case). 2. I added STOUC() around a couple char args that were getting passed to nicechar() when mbrtowc() failed. 3. One of the calls needed to reset the mbstate_t object when continuing to parse the string after mbrtowc() failed. 4. The code in sub_match() (in Src/Zle/compmatch.c) had a bug when it assembled a wide-char value from multiple bytes (decoded from a metafied string): the code was not advancing past all the raw values used if there was a meta char or a multibyte character sequence (and I think a '\0' byte might have even looped infinitely). After doing all that, it was time to begin looking at the next stage of the non-inputable-filename problem: I decided to do the easiest possible change first, so attached is a patch that causes zsh to insert a literal question-mark in place of each errant character. This allows me to at least match a name that cannot be input into the command-line, but it could be a little dangerous if it happens to match other filenames too, so this is probably not something we'd want to use in an actual release. ..wayne.. [-- Attachment #2: questionable.patch --] [-- Type: text/plain, Size: 771 bytes --] --- Src/Zle/zle_utils.c 9 Jan 2006 00:29:57 -0000 1.34 +++ Src/Zle/zle_utils.c 9 Jan 2006 01:20:35 -0000 @@ -277,13 +277,13 @@ stringaszleline(char *instr, int incs, i while (ll > 0) { size_t ret = mbrtowc(outptr, inptr, ll, &ps); - /* - * At this point we don't handle either incomplete (-2) or - * invalid (-1) multibyte sequences. Use the current length - * and return. - */ - if (ret == (size_t)-1 || ret == (size_t)-2) - break; + if (ret == (size_t)-1 || ret == (size_t)-2) { + /* Transform invalid character sequences into literal + * question marks, at least for now... */ + *outptr = L'?'; + ret = 1; + memset(&ps, '\0', sizeof ps); + } /* * Careful: converting a wide NUL returns zero, but we ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-01-12 1:27 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-01-06 21:58 bug in completion/expansion of files with LANG=C Wayne Davison 2006-01-06 22:40 ` Peter Stephenson 2006-01-07 0:59 ` Wayne Davison 2006-01-07 0:17 ` Wayne Davison 2006-01-07 22:44 ` Wayne Davison 2006-01-08 5:56 ` Bart Schaefer 2006-01-08 8:06 ` Wayne Davison 2006-01-08 18:03 ` Peter Stephenson 2006-01-08 23:16 ` Wayne Davison 2006-01-12 1:26 ` Wayne Davison 2006-01-09 1:42 ` Wayne Davison
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).