From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3849 Path: news.gmane.org!not-for-mail From: Roy Newsgroups: gmane.linux.lib.musl.general Subject: Re: Status of Big5 and extensions Date: Thu, 08 Aug 2013 08:18:45 +0800 Message-ID: References: <20130807165044.GA14867@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes Content-Transfer-Encoding: Quoted-Printable X-Trace: ger.gmane.org 1375921147 15993 80.91.229.3 (8 Aug 2013 00:19:07 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 8 Aug 2013 00:19:07 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3853-gllmg-musl=m.gmane.org@lists.openwall.com Thu Aug 08 02:19:10 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V7DwY-0003qC-Ns for gllmg-musl@plane.gmane.org; Thu, 08 Aug 2013 02:19:06 +0200 Original-Received: (qmail 30117 invoked by uid 550); 8 Aug 2013 00:19:04 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 30103 invoked from network); 8 Aug 2013 00:19:04 -0000 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 79 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 203186096008.static.ctinets.com User-Agent: Opera Mail/11.64 (Win32) Xref: news.gmane.org gmane.linux.lib.musl.general:3849 Archived-At: =E5=9C=A8 Thu, 08 Aug 2013 00:50:44 +0800, Rich Felker =E5=AF=AB=E9=81=93: > OK, so after a lot of research and discussion, I'm about to commit the= > first part of Big5 support in iconv. The plain "Big5" charset name is > going to be the maximal set everybody agrees on, which, as far as I > can tell, is just CP950. (Actually even IBM's Big5 variant in ICU > differs from CP950 in a few places, but it's just wrong. The one > ideograph where it differs conflicts with Unihan.txt, which is > authoritative on which Unicode characters encode which character > identities from historical CJK charsets.) > > As for extensions, my understanding of HKSCS is to a point now where I= > feel we can add it (charset name "Big5-HKSCS"), based on the 2008 > government publication which has a few new characters beyond the old > versions in most software. (Thank you nsz for helping dig up all the > files and researching how they differ!) However there are a few > technical difficulties to implementation: the Unicode codepoints span > a 17-bit range rather than just a 16-bit range, so we need an > efficient way of doing the mappings. What's worse, several HKSCS > codepoints map to Latin characters with multiple combining marks which= > have no precomposed representation. Supporting these will require > extending iconv to be able to output two or more characters of output > for each unit of input, which is mildly error-prone with the current > design, so I may hold off on HKSCS support until I overhaul some of > the core logic of the iconv function. > > Now, the hard part: Taiwan extensions. I appreciate all the help from > Roy, but I'm still not to the point of having anything nearly ready to= > be added. The UAO extension set has at least 290 mappings to > codepoints in the Unicode Private Use Areas (PUA), which makes it > unsuitable for inclusion as-is. This may be a resolvable issue if > these 290 characters all exist in Unicode and could be remapped, but I= > do not feel any of us are qualified to determine this. So UAO is > probably still a long way away from being able to be adopted into > iconv, unless we have authoritative data on the identities of these > characters in the form of something from an official standards body or= > at the very least multiple major vendors (enough to ensure that any > future standardization would be consistent and non-controversial). > I did ask one of creator of UAO in = http://forum.moztw.org/viewtopic.php?f=3D10&t=3D40174 And there is a reply of those PUA characters in Big5 view: 0xFA40 - 0xFA63: Reserved for user-defined characters 0xC8A5 - 0xC8B0: for Big5-2003 compliant Others: "Art characters" of ChinaSea character set(csw 1.0, i.e. = cswsmin.tte), which are not mapped to Unicode and the codepoint that has= = not occupied by HKSCS codepoints. > If there are other supersets of CP950 (possibly subsets of UAO) which > would be useful for supporting Taiwanese users, I would very much like= > to understand the situation on them and whether it's feasible to > include them in iconv at this point. Perhaps the most important thing > to know at this point is what practical difficulties exist for > Taiwanese users limited to CP950, and which extension charsets could > solve this. I feel like issues like "I cannot write my own name" are > on a completely different level than "I can't mix Japanese text in > with my legacy-encoded data [when I really should be using Unicode for= > this anyway]". The former sort of issue is something that demands some= > sort of support, even if imperfect or mildly hackish, whereas the > latter is not a justification for hacks and bypassing proper standards= > processes. > > Rich