From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3850 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: Status of Big5 and extensions Date: Wed, 7 Aug 2013 22:11:19 -0400 Message-ID: <20130808021118.GI221@brightrain.aerifal.cx> References: <20130807165044.GA14867@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375927892 13274 80.91.229.3 (8 Aug 2013 02:11:32 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 8 Aug 2013 02:11:32 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3854-gllmg-musl=m.gmane.org@lists.openwall.com Thu Aug 08 04:11:35 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V7FhP-0003Ky-9L for gllmg-musl@plane.gmane.org; Thu, 08 Aug 2013 04:11:35 +0200 Original-Received: (qmail 26245 invoked by uid 550); 8 Aug 2013 02:11:33 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 26235 invoked from network); 8 Aug 2013 02:11:32 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3850 Archived-At: On Thu, Aug 08, 2013 at 08:18:45AM +0800, Roy wrote: > >Now, the hard part: Taiwan extensions. I appreciate all the help from > >Roy, but I'm still not to the point of having anything nearly ready to > >be added. The UAO extension set has at least 290 mappings to > >codepoints in the Unicode Private Use Areas (PUA), which makes it > >unsuitable for inclusion as-is. This may be a resolvable issue if > >these 290 characters all exist in Unicode and could be remapped, but I > >do not feel any of us are qualified to determine this. So UAO is > >probably still a long way away from being able to be adopted into > >iconv, unless we have authoritative data on the identities of these > >characters in the form of something from an official standards body or > >at the very least multiple major vendors (enough to ensure that any > >future standardization would be consistent and non-controversial). > > > > I did ask one of creator of UAO in > http://forum.moztw.org/viewtopic.php?f=10&t=40174 > And there is a reply of those PUA characters in Big5 view: > 0xFA40 - 0xFA63: Reserved for user-defined characters So these can simply be dropped from the mapping. > 0xC8A5 - 0xC8B0: for Big5-2003 compliant I do not see any C8xx mappings in Big5-2003, so this explanation does not seem plausible. Since you mentioned Big5-2003, I've been looking into it, and it seems like it should be part of our base Big5 mapping. Diffing moztw's version of it against CP950.TXT (after cleaning up both), I get: -0xA156 0x2013 +0xA156 0x2015 ... -0xA1C2 0x00AF +0xA1C2 0x203E ... -0xA2A4 0x2550 -0xA2A5 0x255E -0xA2A6 0x256A -0xA2A7 0x2561 +0xA2A4 0x2501 +0xA2A5 0x251D +0xA2A6 0x253F +0xA2A7 0x2525 The above all looks like pure nonstandard Mozilla behavior. What's next is more interesting: -0xA2CC 0x5341 -0xA2CD 0x5344 -0xA2CE 0x5345 +0xA2CC 0x3038 +0xA2CD 0x3039 +0xA2CE 0x303A This looks to me like an actual bug in CP905.TXT from Unicode. Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same for the others. Indeed, CP905.TXT maps these in a non-one-to-one way, which is in itself almost certainly a bug. +0xA3C0 0x2400 +0xA3C1 0x2401 +... +0xA3DF 0x241F +0xA3E0 0x2421 These are all part of ETEN omitted from CP950, and should definitely be in Big5 base. +0xC6A1 0x2460 +0xC6A2 0x2461 +0xC6A3 0x2462 +0xC6A4 0x2463 +... +0xC7F1 0x30F5 +0xC7F2 0x30F6 These are also from ETEN. Notably, the Cyrillic block that immediately follows these is still omitted in Big5-2003, for reasons that appear political. Since ETEN, UAO, and HKSCS all have it, I see no reason not to add the Cyrillic block back in here. Finally: -0xF9FA 0x256D -0xF9FB 0x256E -0xF9FC 0x2570 -0xF9FD 0x256F +0xF9FA 0x2554 +0xF9FB 0x2557 +0xF9FC 0x255A +0xF9FD 0x255D This looks like pure Mozilla cruft. Is there any justification for these sorts of changes? Does the above analysis look correct? If so I will go ahead and merge the above changes to Big5 support into musl. BTW, the only non-PUA part of UAO within the standard Big5 range (89x157 grid) that won't be mapped with these changes is the stuff right after the Cyrillic block. This part does not conflict with current HKSCS, so if I had good sources from both the Taiwan and HK sides supporting the position that these mappings will not conflict with other extensions in current use or with future expansion of HKSCS, we could consider including that part of UAO in the base Big5 mapping. At this point this is only an idea for consideration, but we can keep it in mind. > Others: "Art characters" of ChinaSea character set(csw 1.0, i.e. > cswsmin.tte), which are not mapped to Unicode and the codepoint that > has not occupied by HKSCS codepoints. So basically dingbats? Rich