* zsh doesn't understand some multibyte characters @ 2015-05-13 16:14 Danek Duvall 2015-05-13 17:43 ` Bart Schaefer 0 siblings, 1 reply; 7+ messages in thread From: Danek Duvall @ 2015-05-13 16:14 UTC (permalink / raw) To: zsh-users Perhaps this is just on Solaris, I dunno. But for some multibyte characters ("…", for instance), if I type them on the commandline -- either using the Compose key on my keyboard, or via insert-unicode-character or insert-composed-character -- then if I move the cursor back over them or delete back over them, zsh gets confused and moves two positions instead of one: $ PS1='$ ' $ …_ (cursor at _; now hit backspace once) $_ I do have access to a Linux box running 4.3.17, and it doesn't seem to be a problem there, but I don't know whether that's because of a different version or a different OS. I'll note that the same thing happens with all the other shells on Solaris, so I'm guessing it's not directly a problem with zsh. FWIW, bash (and libreadline) is compiled with Solaris curses, while zsh is compiled with ncurses, so I'm guessing it's not that. But vim has no problems whatsoever (old Solaris vi does). Where else should I be looking for the problem? Thanks, Danek ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: zsh doesn't understand some multibyte characters 2015-05-13 16:14 zsh doesn't understand some multibyte characters Danek Duvall @ 2015-05-13 17:43 ` Bart Schaefer 2015-05-13 18:29 ` Danek Duvall 0 siblings, 1 reply; 7+ messages in thread From: Bart Schaefer @ 2015-05-13 17:43 UTC (permalink / raw) To: Danek Duvall, zsh-users On May 13, 9:14am, Danek Duvall wrote: } Subject: zsh doesn't understand some multibyte characters } } Perhaps this is just on Solaris, I dunno. But for some multibyte } characters [...] if I move the cursor back over them or delete back } over them, zsh gets confused and moves two positions instead of one } } I'll note that the same thing happens with all the other shells on } Solaris [... ] Where else should I be looking for the problem? This sounds like the WCWIDTH() macro or function is returning the wrong value for some characters. If you are compiling your own zsh, can you (a) check whether config.h defines BROKEN_WCWIDTH, and (b) if it does not, try defining it and recompile to see if that makes any difference? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: zsh doesn't understand some multibyte characters 2015-05-13 17:43 ` Bart Schaefer @ 2015-05-13 18:29 ` Danek Duvall 2015-05-13 20:20 ` Bart Schaefer 2015-05-14 16:43 ` Jun T. 0 siblings, 2 replies; 7+ messages in thread From: Danek Duvall @ 2015-05-13 18:29 UTC (permalink / raw) To: Bart Schaefer; +Cc: zsh-users On Wed, May 13, 2015 at 10:43:50AM -0700, Bart Schaefer wrote: > On May 13, 9:14am, Danek Duvall wrote: > } Subject: zsh doesn't understand some multibyte characters > } > } Perhaps this is just on Solaris, I dunno. But for some multibyte > } characters [...] if I move the cursor back over them or delete back > } over them, zsh gets confused and moves two positions instead of one > } > } I'll note that the same thing happens with all the other shells on > } Solaris [... ] Where else should I be looking for the problem? > > This sounds like the WCWIDTH() macro or function is returning the wrong > value for some characters. It does. > If you are compiling your own zsh, can you (a) check whether config.h > defines BROKEN_WCWIDTH, and (b) if it does not, try defining it and > recompile to see if that makes any difference? Not on its own; Solaris doesn't appear to define __STDC_ISO_10646__. But if I #define that to 1 (because nothing in zsh uses its value), then it does work. If I set comb_acute_mb[] = { (char)0xe2, (char)0x80, (char)0xa6 }; in the test, it thinks that character's wcwidth() is 2, not 1. Perhaps that should be a part of the test as well? I don't know why the zero-width combining character was chosen as the test. I'm less sure what to do about __STDC_ISO_10646__. I see that most of the places it's checked you're also checking for __APPLE__, but not all of them (and I'm not sure why that would be). I can talk to our globalization folks who might know why this isn't defined, or what it should be set to, or whatever, and file a bug if necessary. I guess until we figure that out, I can just have our zsh build define it on the commandline (assuming that you don't want to hold 5.0.8 for this, and I wouldn't want you to). Thanks, Danek ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: zsh doesn't understand some multibyte characters 2015-05-13 18:29 ` Danek Duvall @ 2015-05-13 20:20 ` Bart Schaefer 2015-05-13 21:24 ` Chet Ramey 2015-05-14 16:43 ` Jun T. 1 sibling, 1 reply; 7+ messages in thread From: Bart Schaefer @ 2015-05-13 20:20 UTC (permalink / raw) To: zsh-users On May 13, 11:29am, Danek Duvall wrote: } } I'm less sure what to do about __STDC_ISO_10646__. I see that most of the } places it's checked you're also checking for __APPLE__, but not all of them } (and I'm not sure why that would be). } } I can talk to our globalization folks who might know why this isn't } defined, or what it should be set to, or whatever, and file a bug if } necessary. You've hit the limit of my knowledge on this point, I'm afraid. Someone else will have to chime in about __STDC_ISO_10646__. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: zsh doesn't understand some multibyte characters 2015-05-13 20:20 ` Bart Schaefer @ 2015-05-13 21:24 ` Chet Ramey 0 siblings, 0 replies; 7+ messages in thread From: Chet Ramey @ 2015-05-13 21:24 UTC (permalink / raw) To: Bart Schaefer, zsh-users; +Cc: chet.ramey On 5/13/15 4:20 PM, Bart Schaefer wrote: > On May 13, 11:29am, Danek Duvall wrote: > } > } I'm less sure what to do about __STDC_ISO_10646__. I see that most of the > } places it's checked you're also checking for __APPLE__, but not all of them > } (and I'm not sure why that would be). > } > } I can talk to our globalization folks who might know why this isn't > } defined, or what it should be set to, or whatever, and file a bug if > } necessary. > > You've hit the limit of my knowledge on this point, I'm afraid. Someone > else will have to chime in about __STDC_ISO_10646__. That constant has to be defined by the implementation (compiler+libraries). If it's defined -- and it's defined to a date -- it means that wchar_t values are unicode values, as ISO/IEC 10646 defines them on that date, regardless of the locale. Practically, this means that you can convert a 32-bit character value to a multibyte character just by calling, for example, wctomb, and not have to use something like iconv to convert between locales. No application should be defining that macro itself. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRU chet@case.edu http://cnswww.cns.cwru.edu/~chet/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: zsh doesn't understand some multibyte characters 2015-05-13 18:29 ` Danek Duvall 2015-05-13 20:20 ` Bart Schaefer @ 2015-05-14 16:43 ` Jun T. 2015-05-14 17:32 ` Danek Duvall 1 sibling, 1 reply; 7+ messages in thread From: Jun T. @ 2015-05-14 16:43 UTC (permalink / raw) To: zsh-users 2015/05/14 03:29, Danek Duvall <duvall@comfychair.org> wrote > > If I set > > comb_acute_mb[] = { (char)0xe2, (char)0x80, (char)0xa6 }; > > in the test, it thinks that character's wcwidth() is 2, not 1. U+2026 is one of the characters whose "East Asian Width" property is set to "Ambiguous". Widths of these characters are *really* ambiguous; in western (monospaced) fonts they have a single width, while in (most of?) CJK fonts they have double width. Usually, wcwidth() returns 1 for these characters so they are not displayed correctly in CJK fonts, unless applications take spacial care of them. For example, xterm has an option -cjk to handle this problem. Your report indicates that Solaris is one of the rare systems in which wcwidth() returns 2 for U+2026. Are there any fonts in which U+2026 has double width on Solaris? > I don't know why the zero-width > combining character was chosen as the test. The test was first introduced to detect a broken wcwidth() on Mac OS X, where wcwidth() returns 1 for combining characters. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: zsh doesn't understand some multibyte characters 2015-05-14 16:43 ` Jun T. @ 2015-05-14 17:32 ` Danek Duvall 0 siblings, 0 replies; 7+ messages in thread From: Danek Duvall @ 2015-05-14 17:32 UTC (permalink / raw) To: Jun T.; +Cc: zsh-users On Fri, May 15, 2015 at 01:43:45AM +0900, Jun T. wrote: > > 2015/05/14 03:29, Danek Duvall <duvall@comfychair.org> wrote > > > > If I set > > > > comb_acute_mb[] = { (char)0xe2, (char)0x80, (char)0xa6 }; > > > > in the test, it thinks that character's wcwidth() is 2, not 1. > > U+2026 is one of the characters whose "East Asian Width" property > is set to "Ambiguous". Widths of these characters are *really* ambiguous; > in western (monospaced) fonts they have a single width, > while in (most of?) CJK fonts they have double width. > > Usually, wcwidth() returns 1 for these characters so they are not > displayed correctly in CJK fonts, unless applications take spacial care of > them. For example, xterm has an option -cjk to handle this problem. > > Your report indicates that Solaris is one of the rare systems in > which wcwidth() returns 2 for U+2026. > > Are there any fonts in which U+2026 has double width on Solaris? Likely, but I don't know for sure, and I'm not sure how to tell. As one of our globalization folks explained in a long-open bug against Solaris' "broken" wcwidth(), we currently have a single width table, and the ambiguous-width characters all(?) come back as width 2. They're proposing two tables, switched based on the locale -- if you're in an east Asian locale, you'll get 2 for these, and otherwise 1, similarly to the way that gnome-terminal uses VTE_CJK_WIDTH. The only commentary mk_wcwidth() has about ambiguous character widths is in the alternate _cjk implementation, which he doesn't recommend for general use. I don't know if the Solaris approach (double-width in CJK locales, single-width elsewhere) is common enough to want to make this runtime-configurable in programs that care; for instance, zsh could have a setopt flag to switch to double-width when the user knew they were in that environment. I'm a bit surprised that xterm's -cjk option isn't automatic -- shouldn't it know whether the font it's loading is double-width or not? Either way, it could respond to some escape code that programs which care (or even wcwidth() itself or a standard replacement) could use to query it about the current width. Perhaps that's the ideal solution? I'd started talking to Thomas Dickey about this a couple of years ago (I keep running into this problem, start talking to people about it, decide it's too hard and I don't have enough time, and drop it until the next time around); perhaps I could pick that thread up again with that suggestion? FWIW, I tried xterm -cjk, both with my normal western font and with a CJK font, and in both cases it handles U+2026 fine, putting it in a double-wide box. Vim seemed to handle it, too. > > I don't know why the zero-width combining character was chosen as the > > test. > > The test was first introduced to detect a broken wcwidth() on Mac OS X, > where wcwidth() returns 1 for combining characters. Which seems unambiguously broken, unlike the one on Solaris. Thanks, Danek ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-05-14 17:33 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-05-13 16:14 zsh doesn't understand some multibyte characters Danek Duvall 2015-05-13 17:43 ` Bart Schaefer 2015-05-13 18:29 ` Danek Duvall 2015-05-13 20:20 ` Bart Schaefer 2015-05-13 21:24 ` Chet Ramey 2015-05-14 16:43 ` Jun T. 2015-05-14 17:32 ` Danek Duvall
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).