mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Hangul Jamo vowels and trailing consonants should probably be 0 width
@ 2021-12-27 22:38 Luis Javier Merino
  2021-12-27 23:43 ` Rich Felker
  0 siblings, 1 reply; 2+ messages in thread
From: Luis Javier Merino @ 2021-12-27 22:38 UTC (permalink / raw)
  To: musl

Hello,

I've been looking at widths reported for Hangul Jamo in wcwidth implementations.

In glibc and MirBSD xterm, U+1160..U+11FF and U+D7B0..U+D7FF have 0 width.

In xterm/ncurses, glib(g_unichar_iszerowidth), and rust's
unicode-width U+1160..U+11FF have 0 width.

Konsole had U+1160..U+11FF with 0 width until October 2018, but moving
from a wcwidth() based on the Markus Kuhn one to one generated from
Unicode datafiles caused it to return width 1
(https://bugs.kde.org/show_bug.cgi?id=396435#c21).

libunistring, vim/NeoVim, ridiculousfish/widecharwidth seem to know
nothing about Hangul Jamo, and return width 1.

Some context follows:

Korean Hangul is a writing system which uses syllable blocks
consisting of alphabetic components. A syllable consists of one or
more Leading Consonants, one or more Vowels, and zero or more trailing
consonants.

Unicode has precomposed syllable blocks at U+AC00..U+D7A3 (11172).

There are also component Jamos:

Hangul Jamo (U+1100..U+11FF).

U+1100..U+115F Choseong (initial, Leading Consonants) have
East_Asian_Width=Wide and Hangul_Syllable_Type=Leading_Jamo
U+1160..U+11A7 Jungseong (medial, Vowels) have
East_Asian_Width=Neutral and Hangul_Syllable_Type=Vowel_Jamo
U+11A8..U+11FF Jongseong (final, Trailing consonants) have
East_Asian_Width=Neutral and Hangul_Syllable_Type=Trailing_Jamo

U+A960..U+A97F Hangul Jamo Extended-A (choseong) have East_Asian_Width=Wide
U+D7B0..U+D7FF Hangul Jamo Extended-B (jungseong and jongseong) have
East_Asian_Width=Neutral
U+3130..U+318F Hangul Compatibility Jamo have no conjoining behavior
U+FFA0..U+FFDF half-width forms have no conjoining behavior.

U+1100..U+11FF, U+A960..U+A97F, U+D7B0..U+D7FF have conjoining
behavior, a sequence of L+V+T* gets rendered as a syllable block.
wcwidth() implementations tend to give U+1100..U+115F width 2, and
U+1160..U+11FF width 0, so the resulting syllable block has the
correct total width.

U+D7B0..U+D7FF, should also have width 0.

glibc gave width 0 to conjoining jungseong and jongseong at:

 commit 7a79e321c6f85b204036c33d85f6b2aa794e7c76
Author: Thorsten Glaser <tg@mirbsd.de>
Date:   Fri Jul 14 14:02:50 2017 +0200

    Refresh generated charmap data and ChangeLog

            [BZ #21750]
            * charmaps/UTF-8: Refresh.

diff --git a/localedata/ChangeLog b/localedata/ChangeLog
index 04ef5ad071..9e05b4a652 100644
--- a/localedata/ChangeLog
+++ b/localedata/ChangeLog
@@ -1,3 +1,17 @@
+2017-07-14  Thorsten Glaser  <tg@mirbsd.de>
+
+       [BZ #21750]
+       * charmaps/UTF-8: Refresh.
+       * unicode-gen/utf8_gen.py (U+00AD): Set width to 1.
+       * unicode-gen/utf8_gen.py (U+1160..U+11FF): Set width to 0.
+       * unicode-gen/utf8_gen.py (U+3248..U+324F): Set width to 2.
+       * unicode-gen/utf8_gen.py (U+4DC0..U+4DFF): Likewise.
+       * unicode-gen/utf8_gen.py: Treat category Me and Mn as combining.
+       [BZ #19852]
+       * unicode-gen/utf8_gen.py: Process EastAsianWidth lines before
+       UnicodeData lines so the latter have precedence; remove hack
+       to group output by EastAsianWidth ranges.
+

[ ... snip ...]

commit 6e540caa21616d5ec5511fafb22819204525138e
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Tue Jun 16 08:29:40 2020 +0200

    Set width of JUNGSEONG/JONGSEONG characters from UD7B0 to UD7FB to
0 [BZ #26120]
Reviewed-by: default avatarCarlos O'Donell <carlos@redhat.com>

diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index 14c5d4fa33..8cce47cd97 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -48920,6 +48920,8 @@ WIDTH
 <UABE8>        0
 <UABED>        0
 <UAC00>...<UD7A3>      2
+<UD7B0>...<UD7C6>      0
+<UD7CB>...<UD7FB>      0
 <UF900>...<UFA6D>      2
 <UFA70>...<UFAD9>      2
 <UFB1E>        0

Regards,
--
Luis Javier Merino Morán

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [musl] Hangul Jamo vowels and trailing consonants should probably be 0 width
  2021-12-27 22:38 [musl] Hangul Jamo vowels and trailing consonants should probably be 0 width Luis Javier Merino
@ 2021-12-27 23:43 ` Rich Felker
  0 siblings, 0 replies; 2+ messages in thread
From: Rich Felker @ 2021-12-27 23:43 UTC (permalink / raw)
  To: Luis Javier Merino; +Cc: musl

On Mon, Dec 27, 2021 at 11:38:06PM +0100, Luis Javier Merino wrote:
> Hello,
> 
> I've been looking at widths reported for Hangul Jamo in wcwidth implementations.
> 
> In glibc and MirBSD xterm, U+1160..U+11FF and U+D7B0..U+D7FF have 0 width.

Thanks for reporting! Indeed this is a bug and possibly even a
regression since I thought it was right. It looks like it happened in
commit 1b0ce9af6d2aa7b92edaf3e9c631cb635bae22bd, "new wcwidth
implementation (fast table-based)" thanks to the Unicode data not
having this right. Indeed:

-       R(0x1160, 0x11FF, 0),

I'll update the tools that generate the tables to account for the
omission.

Rich

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-12-27 23:44 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-27 22:38 [musl] Hangul Jamo vowels and trailing consonants should probably be 0 width Luis Javier Merino
2021-12-27 23:43 ` Rich Felker

Code repositories for project(s) associated with this inbox:

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).