mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@aerifal.cx>
To: musl@lists.openwall.com
Subject: Re: Re: Status of Big5 and extensions
Date: Thu, 8 Aug 2013 00:30:35 -0400	[thread overview]
Message-ID: <20130808043035.GK221@brightrain.aerifal.cx> (raw)
In-Reply-To: <20130808035321.GN25714@port70.net>

On Thu, Aug 08, 2013 at 05:53:21AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@aerifal.cx> [2013-08-07 22:11:19 -0400]:
> > Since you mentioned Big5-2003, I've been looking into it, and it seems
> > like it should be part of our base Big5 mapping. Diffing moztw's
> > version of it against CP950.TXT (after cleaning up both), I get:
> 
> i checked an other source for big5-2003 and it is bug compatible
> with the moztw one (so it might not be mozilla's fault)
> http://www.csie.ntu.edu.tw/~r92030/project/big5/
> 
> this source maps C255 to 5F5E instead of 5F5D
> (also observed in the icu version of cp950)

Unfortunately this mismatches the normative Unihan.txt which says
U+5F5D corresponds to the historical Big5 character C255, so we need
at least some justification for the change if Unihan.txt is buggy.

> > These are all part of ETEN omitted from CP950, and should definitely
> > be in Big5 base.
> > 
> > +0xC6A1 0x2460
> > +0xC6A2 0x2461
> > +0xC6A3 0x2462
> > +0xC6A4 0x2463
> > +...
> > +0xC7F1 0x30F5
> > +0xC7F2 0x30F6
> > 
> > These are also from ETEN. Notably, the Cyrillic block that immediately
> > follows these is still omitted in Big5-2003, for reasons that appear
> > political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> > to add the Cyrillic block back in here.
> > 
> 
> the C6BF-C6D9 part is incompatible in hkscs and big5-2003
> hkscs == uao != big5-2003 for these codes
> icu agrees with the old hkscs pua codes so this might be
> just a bug in the big5-2003 source

I believe I've dug up the story on this here:

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035389.html

In short, the Big5-2003 mappings from moztw.org are wrong. The "KANGXI
RADICAL" characters in Unicode are compatibility characters. This
means they have compatibility-equivalents which should be used in
Unicode documents in place of them, much like the Greek letter μ
should be used in place of Latin-1 MICRO SIGN (µ), and only exist for
round-trip compatibility with legacy character sets which encode the
character twice. Since Big5 (unlike CNS 11643) does not encode the
Kangxi radicals twice, using them in a mapping to Unicode is wrong use
of Unicode, regardless of what the mapping table from the standards
body says. Thus, I have no problem with going with the UAO/HKSCS way.

According to the above link, however, HKSCS has introduced a problem.
They've double mapped U+5E7A and are thus mapping the one in the C6CD
slot to U+2F33 instead, since the FBF4 slot is mapping to U+5E7A. I'm
not sure what the right solution to this is; since we're not
interested in round-trip, it might make the most sense to just ignore
it and map them both to the (same) proper character.

> > -0xF9FA 0x256D
> > -0xF9FB 0x256E
> > -0xF9FC 0x2570
> > -0xF9FD 0x256F
> > +0xF9FA 0x2554
> > +0xF9FB 0x2557
> > +0xF9FC 0x255A
> > +0xF9FD 0x255D
> > 
> > This looks like pure Mozilla cruft. Is there any justification for
> > these sorts of changes?
> > 
> 
> these are box drawing chars (like A2A4-A2A7 above),
> the diff is double vs light lines
> 
> cp950 == hkscs == uao != big-2003 (and missing from icu)
> 
> hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu)

I don't really care so much about these anyway since they do not
affect linguistic content, just warez .nfo files. ;-)

> > Does the above analysis look correct? If so I will go ahead and merge
> > the above changes to Big5 support into musl.
> > 
> > BTW, the only non-PUA part of UAO within the standard Big5 range
> > (89x157 grid) that won't be mapped with these changes is the stuff
> > right after the Cyrillic block. This part does not conflict with
> > current HKSCS, so if I had good sources from both the Taiwan and HK
> > sides supporting the position that these mappings will not conflict
> > with other extensions in current use or with future expansion of
> > HKSCS, we could consider including that part of UAO in the base Big5
> > mapping. At this point this is only an idea for consideration, but we
> > can keep it in mind.
> 
> note that
> C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
> (old hkscs pua codes agree with uao)

OK, so is this non-conflicting?

Rich


  reply	other threads:[~2013-08-08  4:30 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-07 16:50 Rich Felker
2013-08-08  0:18 ` Roy
2013-08-08  2:11   ` Rich Felker
2013-08-08  3:48     ` Roy
2013-08-08  3:53     ` Szabolcs Nagy
2013-08-08  4:30       ` Rich Felker [this message]
2013-08-08  4:50         ` Szabolcs Nagy
2013-08-08  5:31           ` Rich Felker
2013-08-08  7:19             ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130808043035.GK221@brightrain.aerifal.cx \
    --to=dalias@aerifal.cx \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).