zsh-users
 help / color / mirror / code / Atom feed
* zsh doesn't understand some multibyte characters
@ 2015-05-13 16:14 Danek Duvall
  2015-05-13 17:43 ` Bart Schaefer
  0 siblings, 1 reply; 7+ messages in thread
From: Danek Duvall @ 2015-05-13 16:14 UTC (permalink / raw)
  To: zsh-users

Perhaps this is just on Solaris, I dunno.  But for some multibyte
characters ("…", for instance), if I type them on the commandline -- either
using the Compose key on my keyboard, or via insert-unicode-character or
insert-composed-character -- then if I move the cursor back over them or
delete back over them, zsh gets confused and moves two positions instead of
one:

    $ PS1='$ '
    $ …_ (cursor at _; now hit backspace once)
    $_

I do have access to a Linux box running 4.3.17, and it doesn't seem to be a
problem there, but I don't know whether that's because of a different
version or a different OS.

I'll note that the same thing happens with all the other shells on Solaris,
so I'm guessing it's not directly a problem with zsh.  FWIW, bash (and
libreadline) is compiled with Solaris curses, while zsh is compiled with
ncurses, so I'm guessing it's not that.  But vim has no problems
whatsoever (old Solaris vi does).  Where else should I be looking for the
problem?

Thanks,
Danek


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh doesn't understand some multibyte characters
  2015-05-13 16:14 zsh doesn't understand some multibyte characters Danek Duvall
@ 2015-05-13 17:43 ` Bart Schaefer
  2015-05-13 18:29   ` Danek Duvall
  0 siblings, 1 reply; 7+ messages in thread
From: Bart Schaefer @ 2015-05-13 17:43 UTC (permalink / raw)
  To: Danek Duvall, zsh-users

On May 13,  9:14am, Danek Duvall wrote:
} Subject: zsh doesn't understand some multibyte characters
}
} Perhaps this is just on Solaris, I dunno. But for some multibyte
} characters [...] if I move the cursor back over them or delete back
} over them, zsh gets confused and moves two positions instead of one
}
} I'll note that the same thing happens with all the other shells on
} Solaris [... ] Where else should I be looking for the problem?

This sounds like the WCWIDTH() macro or function is returning the wrong
value for some characters.

If you are compiling your own zsh, can you (a) check whether config.h
defines BROKEN_WCWIDTH, and (b) if it does not, try defining it and
recompile to see if that makes any difference?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh doesn't understand some multibyte characters
  2015-05-13 17:43 ` Bart Schaefer
@ 2015-05-13 18:29   ` Danek Duvall
  2015-05-13 20:20     ` Bart Schaefer
  2015-05-14 16:43     ` Jun T.
  0 siblings, 2 replies; 7+ messages in thread
From: Danek Duvall @ 2015-05-13 18:29 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-users

On Wed, May 13, 2015 at 10:43:50AM -0700, Bart Schaefer wrote:

> On May 13,  9:14am, Danek Duvall wrote:
> } Subject: zsh doesn't understand some multibyte characters
> }
> } Perhaps this is just on Solaris, I dunno. But for some multibyte
> } characters [...] if I move the cursor back over them or delete back
> } over them, zsh gets confused and moves two positions instead of one
> }
> } I'll note that the same thing happens with all the other shells on
> } Solaris [... ] Where else should I be looking for the problem?
> 
> This sounds like the WCWIDTH() macro or function is returning the wrong
> value for some characters.

It does.

> If you are compiling your own zsh, can you (a) check whether config.h
> defines BROKEN_WCWIDTH, and (b) if it does not, try defining it and
> recompile to see if that makes any difference?

Not on its own; Solaris doesn't appear to define __STDC_ISO_10646__.  But
if I #define that to 1 (because nothing in zsh uses its value), then it
does work.

If I set

    comb_acute_mb[] = { (char)0xe2, (char)0x80, (char)0xa6 };

in the test, it thinks that character's wcwidth() is 2, not 1.  Perhaps
that should be a part of the test as well?  I don't know why the zero-width
combining character was chosen as the test.

I'm less sure what to do about __STDC_ISO_10646__.  I see that most of the
places it's checked you're also checking for __APPLE__, but not all of them
(and I'm not sure why that would be).

I can talk to our globalization folks who might know why this isn't
defined, or what it should be set to, or whatever, and file a bug if
necessary.  I guess until we figure that out, I can just have our zsh build
define it on the commandline (assuming that you don't want to hold 5.0.8
for this, and I wouldn't want you to).

Thanks,
Danek


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh doesn't understand some multibyte characters
  2015-05-13 18:29   ` Danek Duvall
@ 2015-05-13 20:20     ` Bart Schaefer
  2015-05-13 21:24       ` Chet Ramey
  2015-05-14 16:43     ` Jun T.
  1 sibling, 1 reply; 7+ messages in thread
From: Bart Schaefer @ 2015-05-13 20:20 UTC (permalink / raw)
  To: zsh-users

On May 13, 11:29am, Danek Duvall wrote:
}
} I'm less sure what to do about __STDC_ISO_10646__.  I see that most of the
} places it's checked you're also checking for __APPLE__, but not all of them
} (and I'm not sure why that would be).
} 
} I can talk to our globalization folks who might know why this isn't
} defined, or what it should be set to, or whatever, and file a bug if
} necessary.

You've hit the limit of my knowledge on this point, I'm afraid.  Someone
else will have to chime in about __STDC_ISO_10646__.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh doesn't understand some multibyte characters
  2015-05-13 20:20     ` Bart Schaefer
@ 2015-05-13 21:24       ` Chet Ramey
  0 siblings, 0 replies; 7+ messages in thread
From: Chet Ramey @ 2015-05-13 21:24 UTC (permalink / raw)
  To: Bart Schaefer, zsh-users; +Cc: chet.ramey

On 5/13/15 4:20 PM, Bart Schaefer wrote:
> On May 13, 11:29am, Danek Duvall wrote:
> }
> } I'm less sure what to do about __STDC_ISO_10646__.  I see that most of the
> } places it's checked you're also checking for __APPLE__, but not all of them
> } (and I'm not sure why that would be).
> } 
> } I can talk to our globalization folks who might know why this isn't
> } defined, or what it should be set to, or whatever, and file a bug if
> } necessary.
> 
> You've hit the limit of my knowledge on this point, I'm afraid.  Someone
> else will have to chime in about __STDC_ISO_10646__.

That constant has to be defined by the implementation (compiler+libraries).
If it's defined -- and it's defined to a date -- it means that wchar_t
values are unicode values, as ISO/IEC 10646 defines them on that date,
regardless of the locale.  Practically, this means that you can convert a
32-bit character value to a multibyte character just by calling, for
example, wctomb, and not have to use something like iconv to convert
between locales.  No application should be defining that macro itself.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    chet@case.edu    http://cnswww.cns.cwru.edu/~chet/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh doesn't understand some multibyte characters
  2015-05-13 18:29   ` Danek Duvall
  2015-05-13 20:20     ` Bart Schaefer
@ 2015-05-14 16:43     ` Jun T.
  2015-05-14 17:32       ` Danek Duvall
  1 sibling, 1 reply; 7+ messages in thread
From: Jun T. @ 2015-05-14 16:43 UTC (permalink / raw)
  To: zsh-users


2015/05/14 03:29, Danek Duvall <duvall@comfychair.org> wrote
> 
> If I set
> 
>    comb_acute_mb[] = { (char)0xe2, (char)0x80, (char)0xa6 };
> 
> in the test, it thinks that character's wcwidth() is 2, not 1.

U+2026 is one of the characters whose "East Asian Width" property
is set to "Ambiguous". Widths of these characters are *really* ambiguous;
in western (monospaced) fonts they have a single width,
while in (most of?) CJK fonts they have double width.

Usually, wcwidth() returns 1 for these characters so they are not
displayed correctly in CJK fonts, unless applications take spacial care of
them. For example, xterm has an option -cjk to handle this problem.

Your report indicates that Solaris is one of the rare systems in
which wcwidth() returns 2 for U+2026.

Are there any fonts in which U+2026 has double width on Solaris?

> I don't know why the zero-width
> combining character was chosen as the test.

The test was first introduced to detect a broken wcwidth() on Mac OS X,
where wcwidth() returns 1 for combining characters.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: zsh doesn't understand some multibyte characters
  2015-05-14 16:43     ` Jun T.
@ 2015-05-14 17:32       ` Danek Duvall
  0 siblings, 0 replies; 7+ messages in thread
From: Danek Duvall @ 2015-05-14 17:32 UTC (permalink / raw)
  To: Jun T.; +Cc: zsh-users

On Fri, May 15, 2015 at 01:43:45AM +0900, Jun T. wrote:

> 
> 2015/05/14 03:29, Danek Duvall <duvall@comfychair.org> wrote
> > 
> > If I set
> > 
> >    comb_acute_mb[] = { (char)0xe2, (char)0x80, (char)0xa6 };
> > 
> > in the test, it thinks that character's wcwidth() is 2, not 1.
> 
> U+2026 is one of the characters whose "East Asian Width" property
> is set to "Ambiguous". Widths of these characters are *really* ambiguous;
> in western (monospaced) fonts they have a single width,
> while in (most of?) CJK fonts they have double width.
> 
> Usually, wcwidth() returns 1 for these characters so they are not
> displayed correctly in CJK fonts, unless applications take spacial care of
> them. For example, xterm has an option -cjk to handle this problem.
> 
> Your report indicates that Solaris is one of the rare systems in
> which wcwidth() returns 2 for U+2026.
> 
> Are there any fonts in which U+2026 has double width on Solaris?

Likely, but I don't know for sure, and I'm not sure how to tell.

As one of our globalization folks explained in a long-open bug against
Solaris' "broken" wcwidth(), we currently have a single width table, and
the ambiguous-width characters all(?) come back as width 2.  They're
proposing two tables, switched based on the locale -- if you're in an east
Asian locale, you'll get 2 for these, and otherwise 1, similarly to the way
that gnome-terminal uses VTE_CJK_WIDTH.

The only commentary mk_wcwidth() has about ambiguous character widths is in
the alternate _cjk implementation, which he doesn't recommend for general
use.  I don't know if the Solaris approach (double-width in CJK locales,
single-width elsewhere) is common enough to want to make this
runtime-configurable in programs that care; for instance, zsh could have a
setopt flag to switch to double-width when the user knew they were in that
environment.

I'm a bit surprised that xterm's -cjk option isn't automatic -- shouldn't
it know whether the font it's loading is double-width or not?  Either way,
it could respond to some escape code that programs which care (or even
wcwidth() itself or a standard replacement) could use to query it about the
current width.  Perhaps that's the ideal solution?

I'd started talking to Thomas Dickey about this a couple of years ago (I
keep running into this problem, start talking to people about it, decide
it's too hard and I don't have enough time, and drop it until the next time
around); perhaps I could pick that thread up again with that suggestion?

FWIW, I tried xterm -cjk, both with my normal western font and with a CJK
font, and in both cases it handles U+2026 fine, putting it in a double-wide
box.  Vim seemed to handle it, too.

> > I don't know why the zero-width combining character was chosen as the
> > test.
> 
> The test was first introduced to detect a broken wcwidth() on Mac OS X,
> where wcwidth() returns 1 for combining characters.

Which seems unambiguously broken, unlike the one on Solaris.

Thanks,
Danek


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-05-14 17:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-13 16:14 zsh doesn't understand some multibyte characters Danek Duvall
2015-05-13 17:43 ` Bart Schaefer
2015-05-13 18:29   ` Danek Duvall
2015-05-13 20:20     ` Bart Schaefer
2015-05-13 21:24       ` Chet Ramey
2015-05-14 16:43     ` Jun T.
2015-05-14 17:32       ` Danek Duvall

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).