From: Rich Felker <dalias@aerifal.cx>
To: musl@lists.openwall.com
Subject: Re: Re: Status of Big5 and extensions
Date: Thu, 8 Aug 2013 00:30:35 -0400 [thread overview]
Message-ID: <20130808043035.GK221@brightrain.aerifal.cx> (raw)
In-Reply-To: <20130808035321.GN25714@port70.net>
On Thu, Aug 08, 2013 at 05:53:21AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@aerifal.cx> [2013-08-07 22:11:19 -0400]:
> > Since you mentioned Big5-2003, I've been looking into it, and it seems
> > like it should be part of our base Big5 mapping. Diffing moztw's
> > version of it against CP950.TXT (after cleaning up both), I get:
>
> i checked an other source for big5-2003 and it is bug compatible
> with the moztw one (so it might not be mozilla's fault)
> http://www.csie.ntu.edu.tw/~r92030/project/big5/
>
> this source maps C255 to 5F5E instead of 5F5D
> (also observed in the icu version of cp950)
Unfortunately this mismatches the normative Unihan.txt which says
U+5F5D corresponds to the historical Big5 character C255, so we need
at least some justification for the change if Unihan.txt is buggy.
> > These are all part of ETEN omitted from CP950, and should definitely
> > be in Big5 base.
> >
> > +0xC6A1 0x2460
> > +0xC6A2 0x2461
> > +0xC6A3 0x2462
> > +0xC6A4 0x2463
> > +...
> > +0xC7F1 0x30F5
> > +0xC7F2 0x30F6
> >
> > These are also from ETEN. Notably, the Cyrillic block that immediately
> > follows these is still omitted in Big5-2003, for reasons that appear
> > political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> > to add the Cyrillic block back in here.
> >
>
> the C6BF-C6D9 part is incompatible in hkscs and big5-2003
> hkscs == uao != big5-2003 for these codes
> icu agrees with the old hkscs pua codes so this might be
> just a bug in the big5-2003 source
I believe I've dug up the story on this here:
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035389.html
In short, the Big5-2003 mappings from moztw.org are wrong. The "KANGXI
RADICAL" characters in Unicode are compatibility characters. This
means they have compatibility-equivalents which should be used in
Unicode documents in place of them, much like the Greek letter μ
should be used in place of Latin-1 MICRO SIGN (µ), and only exist for
round-trip compatibility with legacy character sets which encode the
character twice. Since Big5 (unlike CNS 11643) does not encode the
Kangxi radicals twice, using them in a mapping to Unicode is wrong use
of Unicode, regardless of what the mapping table from the standards
body says. Thus, I have no problem with going with the UAO/HKSCS way.
According to the above link, however, HKSCS has introduced a problem.
They've double mapped U+5E7A and are thus mapping the one in the C6CD
slot to U+2F33 instead, since the FBF4 slot is mapping to U+5E7A. I'm
not sure what the right solution to this is; since we're not
interested in round-trip, it might make the most sense to just ignore
it and map them both to the (same) proper character.
> > -0xF9FA 0x256D
> > -0xF9FB 0x256E
> > -0xF9FC 0x2570
> > -0xF9FD 0x256F
> > +0xF9FA 0x2554
> > +0xF9FB 0x2557
> > +0xF9FC 0x255A
> > +0xF9FD 0x255D
> >
> > This looks like pure Mozilla cruft. Is there any justification for
> > these sorts of changes?
> >
>
> these are box drawing chars (like A2A4-A2A7 above),
> the diff is double vs light lines
>
> cp950 == hkscs == uao != big-2003 (and missing from icu)
>
> hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu)
I don't really care so much about these anyway since they do not
affect linguistic content, just warez .nfo files. ;-)
> > Does the above analysis look correct? If so I will go ahead and merge
> > the above changes to Big5 support into musl.
> >
> > BTW, the only non-PUA part of UAO within the standard Big5 range
> > (89x157 grid) that won't be mapped with these changes is the stuff
> > right after the Cyrillic block. This part does not conflict with
> > current HKSCS, so if I had good sources from both the Taiwan and HK
> > sides supporting the position that these mappings will not conflict
> > with other extensions in current use or with future expansion of
> > HKSCS, we could consider including that part of UAO in the base Big5
> > mapping. At this point this is only an idea for consideration, but we
> > can keep it in mind.
>
> note that
> C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
> (old hkscs pua codes agree with uao)
OK, so is this non-conflicting?
Rich
next prev parent reply other threads:[~2013-08-08 4:30 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-07 16:50 Rich Felker
2013-08-08 0:18 ` Roy
2013-08-08 2:11 ` Rich Felker
2013-08-08 3:48 ` Roy
2013-08-08 3:53 ` Szabolcs Nagy
2013-08-08 4:30 ` Rich Felker [this message]
2013-08-08 4:50 ` Szabolcs Nagy
2013-08-08 5:31 ` Rich Felker
2013-08-08 7:19 ` Rich Felker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130808043035.GK221@brightrain.aerifal.cx \
--to=dalias@aerifal.cx \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).