From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3846 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Status of Big5 and extensions Date: Wed, 7 Aug 2013 12:50:44 -0400 Message-ID: <20130807165044.GA14867@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375894259 16696 80.91.229.3 (7 Aug 2013 16:50:59 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 7 Aug 2013 16:50:59 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3850-gllmg-musl=m.gmane.org@lists.openwall.com Wed Aug 07 18:51:02 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V76wu-0006pa-QQ for gllmg-musl@plane.gmane.org; Wed, 07 Aug 2013 18:51:00 +0200 Original-Received: (qmail 28574 invoked by uid 550); 7 Aug 2013 16:50:58 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 28563 invoked from network); 7 Aug 2013 16:50:58 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3846 Archived-At: OK, so after a lot of research and discussion, I'm about to commit the first part of Big5 support in iconv. The plain "Big5" charset name is going to be the maximal set everybody agrees on, which, as far as I can tell, is just CP950. (Actually even IBM's Big5 variant in ICU differs from CP950 in a few places, but it's just wrong. The one ideograph where it differs conflicts with Unihan.txt, which is authoritative on which Unicode characters encode which character identities from historical CJK charsets.) As for extensions, my understanding of HKSCS is to a point now where I feel we can add it (charset name "Big5-HKSCS"), based on the 2008 government publication which has a few new characters beyond the old versions in most software. (Thank you nsz for helping dig up all the files and researching how they differ!) However there are a few technical difficulties to implementation: the Unicode codepoints span a 17-bit range rather than just a 16-bit range, so we need an efficient way of doing the mappings. What's worse, several HKSCS codepoints map to Latin characters with multiple combining marks which have no precomposed representation. Supporting these will require extending iconv to be able to output two or more characters of output for each unit of input, which is mildly error-prone with the current design, so I may hold off on HKSCS support until I overhaul some of the core logic of the iconv function. Now, the hard part: Taiwan extensions. I appreciate all the help from Roy, but I'm still not to the point of having anything nearly ready to be added. The UAO extension set has at least 290 mappings to codepoints in the Unicode Private Use Areas (PUA), which makes it unsuitable for inclusion as-is. This may be a resolvable issue if these 290 characters all exist in Unicode and could be remapped, but I do not feel any of us are qualified to determine this. So UAO is probably still a long way away from being able to be adopted into iconv, unless we have authoritative data on the identities of these characters in the form of something from an official standards body or at the very least multiple major vendors (enough to ensure that any future standardization would be consistent and non-controversial). If there are other supersets of CP950 (possibly subsets of UAO) which would be useful for supporting Taiwanese users, I would very much like to understand the situation on them and whether it's feasible to include them in iconv at this point. Perhaps the most important thing to know at this point is what practical difficulties exist for Taiwanese users limited to CP950, and which extension charsets could solve this. I feel like issues like "I cannot write my own name" are on a completely different level than "I can't mix Japanese text in with my legacy-encoded data [when I really should be using Unicode for this anyway]". The former sort of issue is something that demands some sort of support, even if imperfect or mildly hackish, whereas the latter is not a justification for hacks and bypassing proper standards processes. Rich