mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Szabolcs Nagy <nsz@port70.net>
To: musl@lists.openwall.com
Subject: Re: Re: Status of Big5 and extensions
Date: Thu, 8 Aug 2013 05:53:21 +0200	[thread overview]
Message-ID: <20130808035321.GN25714@port70.net> (raw)
In-Reply-To: <20130808021118.GI221@brightrain.aerifal.cx>

* Rich Felker <dalias@aerifal.cx> [2013-08-07 22:11:19 -0400]:
> Since you mentioned Big5-2003, I've been looking into it, and it seems
> like it should be part of our base Big5 mapping. Diffing moztw's
> version of it against CP950.TXT (after cleaning up both), I get:

i checked an other source for big5-2003 and it is bug compatible
with the moztw one (so it might not be mozilla's fault)
http://www.csie.ntu.edu.tw/~r92030/project/big5/

this source maps C255 to 5F5E instead of 5F5D
(also observed in the icu version of cp950)

> -0xA156 0x2013
> +0xA156 0x2015
> ...
> -0xA1C2 0x00AF
> +0xA1C2 0x203E
> ...
> -0xA2A4 0x2550
> -0xA2A5 0x255E
> -0xA2A6 0x256A
> -0xA2A7 0x2561
> +0xA2A4 0x2501
> +0xA2A5 0x251D
> +0xA2A6 0x253F
> +0xA2A7 0x2525
> 
> The above all looks like pure nonstandard Mozilla behavior. What's
> next is more interesting:
> 
> -0xA2CC 0x5341
> -0xA2CD 0x5344
> -0xA2CE 0x5345
> +0xA2CC 0x3038
> +0xA2CD 0x3039
> +0xA2CE 0x303A
> 
> This looks to me like an actual bug in CP905.TXT from Unicode.
> Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same
> for the others. Indeed, CP905.TXT maps these in a non-one-to-one way,
> which is in itself almost certainly a bug.
> 
> +0xA3C0 0x2400
> +0xA3C1 0x2401
> +...
> +0xA3DF 0x241F
> +0xA3E0 0x2421
> 
> These are all part of ETEN omitted from CP950, and should definitely
> be in Big5 base.
> 
> +0xC6A1 0x2460
> +0xC6A2 0x2461
> +0xC6A3 0x2462
> +0xC6A4 0x2463
> +...
> +0xC7F1 0x30F5
> +0xC7F2 0x30F6
> 
> These are also from ETEN. Notably, the Cyrillic block that immediately
> follows these is still omitted in Big5-2003, for reasons that appear
> political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> to add the Cyrillic block back in here.
> 

the C6BF-C6D9 part is incompatible in hkscs and big5-2003
hkscs == uao != big5-2003 for these codes
icu agrees with the old hkscs pua codes so this might be
just a bug in the big5-2003 source

> Finally:
> 
> -0xF9FA 0x256D
> -0xF9FB 0x256E
> -0xF9FC 0x2570
> -0xF9FD 0x256F
> +0xF9FA 0x2554
> +0xF9FB 0x2557
> +0xF9FC 0x255A
> +0xF9FD 0x255D
> 
> This looks like pure Mozilla cruft. Is there any justification for
> these sorts of changes?
> 

these are box drawing chars (like A2A4-A2A7 above),
the diff is double vs light lines

cp950 == hkscs == uao != big-2003 (and missing from icu)

hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu)

> Does the above analysis look correct? If so I will go ahead and merge
> the above changes to Big5 support into musl.
> 
> BTW, the only non-PUA part of UAO within the standard Big5 range
> (89x157 grid) that won't be mapped with these changes is the stuff
> right after the Cyrillic block. This part does not conflict with
> current HKSCS, so if I had good sources from both the Taiwan and HK
> sides supporting the position that these mappings will not conflict
> with other extensions in current use or with future expansion of
> HKSCS, we could consider including that part of UAO in the base Big5
> mapping. At this point this is only an idea for consideration, but we
> can keep it in mind.

note that
C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
(old hkscs pua codes agree with uao)


  parent reply	other threads:[~2013-08-08  3:53 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-07 16:50 Rich Felker
2013-08-08  0:18 ` Roy
2013-08-08  2:11   ` Rich Felker
2013-08-08  3:48     ` Roy
2013-08-08  3:53     ` Szabolcs Nagy [this message]
2013-08-08  4:30       ` Rich Felker
2013-08-08  4:50         ` Szabolcs Nagy
2013-08-08  5:31           ` Rich Felker
2013-08-08  7:19             ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130808035321.GN25714@port70.net \
    --to=nsz@port70.net \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).