From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3851 Path: news.gmane.org!not-for-mail From: Roy Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: Status of Big5 and extensions Date: Thu, 08 Aug 2013 11:48:26 +0800 Message-ID: References: <20130807165044.GA14867@brightrain.aerifal.cx> <20130808021118.GI221@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1375933720 1207 80.91.229.3 (8 Aug 2013 03:48:40 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 8 Aug 2013 03:48:40 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3855-gllmg-musl=m.gmane.org@lists.openwall.com Thu Aug 08 05:48:44 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V7HDP-0001y0-Hd for gllmg-musl@plane.gmane.org; Thu, 08 Aug 2013 05:48:43 +0200 Original-Received: (qmail 13931 invoked by uid 550); 8 Aug 2013 03:48:42 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 13923 invoked from network); 8 Aug 2013 03:48:42 -0000 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 150 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 203186096008.static.ctinets.com User-Agent: Opera Mail/11.64 (Win32) Xref: news.gmane.org gmane.linux.lib.musl.general:3851 Archived-At: On Thu, 08 Aug 2013 10:11:19 +0800, Rich Felker wrote: > On Thu, Aug 08, 2013 at 08:18:45AM +0800, Roy wrote: >> >Now, the hard part: Taiwan extensions. I appreciate all the help from >> >Roy, but I'm still not to the point of having anything nearly ready to >> >be added. The UAO extension set has at least 290 mappings to >> >codepoints in the Unicode Private Use Areas (PUA), which makes it >> >unsuitable for inclusion as-is. This may be a resolvable issue if >> >these 290 characters all exist in Unicode and could be remapped, but I >> >do not feel any of us are qualified to determine this. So UAO is >> >probably still a long way away from being able to be adopted into >> >iconv, unless we have authoritative data on the identities of these >> >characters in the form of something from an official standards body or >> >at the very least multiple major vendors (enough to ensure that any >> >future standardization would be consistent and non-controversial). >> > >> >> I did ask one of creator of UAO in >> http://forum.moztw.org/viewtopic.php?f=10&t=40174 >> And there is a reply of those PUA characters in Big5 view: >> 0xFA40 - 0xFA63: Reserved for user-defined characters > > So these can simply be dropped from the mapping. > >> 0xC8A5 - 0xC8B0: for Big5-2003 compliant > > I do not see any C8xx mappings in Big5-2003, so this explanation does > not seem plausible. So do I, but this area exists in draft marked as reserved. http://web.archive.org/web/20041210015709/http://pingyeh.net/big5/big5-2003-v3/summary.txt > > Since you mentioned Big5-2003, I've been looking into it, and it seems > like it should be part of our base Big5 mapping. Diffing moztw's > version of it against CP950.TXT (after cleaning up both), I get: > > -0xA156 0x2013 > +0xA156 0x2015 > ... > -0xA1C2 0x00AF > +0xA1C2 0x203E > ... > -0xA2A4 0x2550 > -0xA2A5 0x255E > -0xA2A6 0x256A > -0xA2A7 0x2561 > +0xA2A4 0x2501 > +0xA2A5 0x251D > +0xA2A6 0x253F > +0xA2A7 0x2525 > > The above all looks like pure nonstandard Mozilla behavior. What's > next is more interesting: > > -0xA2CC 0x5341 > -0xA2CD 0x5344 > -0xA2CE 0x5345 > +0xA2CC 0x3038 > +0xA2CD 0x3039 > +0xA2CE 0x303A > > This looks to me like an actual bug in CP905.TXT from Unicode. > Unihan.txt says U+5341 is Big5's A451, so it can't also be A2CC. Same > for the others. Indeed, CP905.TXT maps these in a non-one-to-one way, > which is in itself almost certainly a bug. The move from Unihan to Symbol of 0xA2CC-0xA2CE is approved in Big5-2003 meeting, minutes vote #1: http://web.archive.org/web/20050307160806/http://pingyeh.net/big5/big5-2003-v3/1106-2003-note.txt > > +0xA3C0 0x2400 > +0xA3C1 0x2401 > +... > +0xA3DF 0x241F > +0xA3E0 0x2421 > > These are all part of ETEN omitted from CP950, and should definitely > be in Big5 base. Those are control symbols. Listed in summary.txt > > +0xC6A1 0x2460 > +0xC6A2 0x2461 > +0xC6A3 0x2462 > +0xC6A4 0x2463 > +... > +0xC7F1 0x30F5 > +0xC7F2 0x30F6 > > These are also from ETEN. Notably, the Cyrillic block that immediately > follows these is still omitted in Big5-2003, for reasons that appear > political. Since ETEN, UAO, and HKSCS all have it, I see no reason not > to add the Cyrillic block back in here. Government said Cyrillic block is not in CNS 11643, so removed. But YES we should have it. > > Finally: > > -0xF9FA 0x256D > -0xF9FB 0x256E > -0xF9FC 0x2570 > -0xF9FD 0x256F > +0xF9FA 0x2554 > +0xF9FB 0x2557 > +0xF9FC 0x255A > +0xF9FD 0x255D > > This looks like pure Mozilla cruft. Is there any justification for > these sorts of changes? Another vote in meeting, minutes vote #2. Big5-2003 draft v3 (should be identical to released version) for reference: http://web.archive.org/web/20051017043312/http://pingyeh.net/big5/big5-2003-v3/big5-2003-draft-v3.csv BTW MS votes in the meeting but she didn't use Big5-2003 in her newer OSes, WTF? > > Does the above analysis look correct? If so I will go ahead and merge > the above changes to Big5 support into musl. Mostly correct IMHO. > > BTW, the only non-PUA part of UAO within the standard Big5 range > (89x157 grid) that won't be mapped with these changes is the stuff > right after the Cyrillic block. This part does not conflict with > current HKSCS, so if I had good sources from both the Taiwan and HK > sides supporting the position that these mappings will not conflict > with other extensions in current use or with future expansion of > HKSCS, we could consider including that part of UAO in the base Big5 > mapping. At this point this is only an idea for consideration, but we > can keep it in mind. > >> Others: "Art characters" of ChinaSea character set(csw 1.0, i.e. >> cswsmin.tte), which are not mapped to Unicode and the codepoint that >> has not occupied by HKSCS codepoints. > > So basically dingbats? No only dingbats AFAIK. > > Rich