From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 67 invoked by alias); 14 May 2015 17:33:04 -0000 Mailing-List: contact zsh-users-help@zsh.org; run by ezmlm Precedence: bulk X-No-Archive: yes List-Id: Zsh Users List List-Post: List-Help: X-Seq: 20214 Received: (qmail 4750 invoked from network); 14 May 2015 17:33:01 -0000 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_PASS autolearn=ham version=3.3.2 Date: Thu, 14 May 2015 10:32:50 -0700 From: Danek Duvall To: "Jun T." Cc: zsh-users@zsh.org Subject: Re: zsh doesn't understand some multibyte characters Message-ID: <20150514173250.GB14025@lorien.comfychair.org> Mail-Followup-To: Danek Duvall , "Jun T." , zsh-users@zsh.org References: <20150513161411.GA4834@lorien.comfychair.org> <150513104350.ZM28203@torch.brasslantern.com> <20150513182942.GB4834@lorien.comfychair.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2010-04-22) On Fri, May 15, 2015 at 01:43:45AM +0900, Jun T. wrote: > > 2015/05/14 03:29, Danek Duvall wrote > > > > If I set > > > > comb_acute_mb[] = { (char)0xe2, (char)0x80, (char)0xa6 }; > > > > in the test, it thinks that character's wcwidth() is 2, not 1. > > U+2026 is one of the characters whose "East Asian Width" property > is set to "Ambiguous". Widths of these characters are *really* ambiguous; > in western (monospaced) fonts they have a single width, > while in (most of?) CJK fonts they have double width. > > Usually, wcwidth() returns 1 for these characters so they are not > displayed correctly in CJK fonts, unless applications take spacial care of > them. For example, xterm has an option -cjk to handle this problem. > > Your report indicates that Solaris is one of the rare systems in > which wcwidth() returns 2 for U+2026. > > Are there any fonts in which U+2026 has double width on Solaris? Likely, but I don't know for sure, and I'm not sure how to tell. As one of our globalization folks explained in a long-open bug against Solaris' "broken" wcwidth(), we currently have a single width table, and the ambiguous-width characters all(?) come back as width 2. They're proposing two tables, switched based on the locale -- if you're in an east Asian locale, you'll get 2 for these, and otherwise 1, similarly to the way that gnome-terminal uses VTE_CJK_WIDTH. The only commentary mk_wcwidth() has about ambiguous character widths is in the alternate _cjk implementation, which he doesn't recommend for general use. I don't know if the Solaris approach (double-width in CJK locales, single-width elsewhere) is common enough to want to make this runtime-configurable in programs that care; for instance, zsh could have a setopt flag to switch to double-width when the user knew they were in that environment. I'm a bit surprised that xterm's -cjk option isn't automatic -- shouldn't it know whether the font it's loading is double-width or not? Either way, it could respond to some escape code that programs which care (or even wcwidth() itself or a standard replacement) could use to query it about the current width. Perhaps that's the ideal solution? I'd started talking to Thomas Dickey about this a couple of years ago (I keep running into this problem, start talking to people about it, decide it's too hard and I don't have enough time, and drop it until the next time around); perhaps I could pick that thread up again with that suggestion? FWIW, I tried xterm -cjk, both with my normal western font and with a CJK font, and in both cases it handles U+2026 fine, putting it in a double-wide box. Vim seemed to handle it, too. > > I don't know why the zero-width combining character was chosen as the > > test. > > The test was first introduced to detect a broken wcwidth() on Mac OS X, > where wcwidth() returns 1 for combining characters. Which seems unambiguously broken, unlike the one on Solaris. Thanks, Danek