From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3812 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: iconv Korean and Traditional Chinese research so far Date: Sun, 4 Aug 2013 12:51:52 -0400 Message-ID: <20130804165152.GA32076@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375635125 15906 80.91.229.3 (4 Aug 2013 16:52:05 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 4 Aug 2013 16:52:05 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3816-gllmg-musl=m.gmane.org@lists.openwall.com Sun Aug 04 18:52:08 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V61XK-0003ie-G7 for gllmg-musl@plane.gmane.org; Sun, 04 Aug 2013 18:52:06 +0200 Original-Received: (qmail 20465 invoked by uid 550); 4 Aug 2013 16:52:04 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 20453 invoked from network); 4 Aug 2013 16:52:04 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3812 Archived-At: OK, so here's what I've found so far. Both legacy Korean and legacy Traditional Chinese encodings have essentially a single base character set: Korean: KS X 1001 (previously known as KS C 5601) 93 x 94 DBCS grid (A1-FD A1-FE) All characters in BMP 17484 bytes table space Traditional Chinese: Big5 (CP950) 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE) All characters in BMP 27946 bytes table space Both of these have various minor extensions, but the main extensions of any relevance seem to be: Korean: CP949 Lead byte range is extended to 81-FD (125) Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126) 44500 bytes table space Traditional Chinese: HKSCS (CP951) Lead byte range is extended to 88-FE (119) 1651 characters outside BMP 37366 bytes table space for 16-bit mapping table, plus extra mapping needed for characters outside BMP The big remaining questions are: 1. How important are these extensions? I would guess the answer is "fairly important", espectially for HKSCS where I believe the additional characters are needed for encoding Cantonese words, but it's less clear to me whether the Korean extensions are useful (they seem to mainly be for the sake of completeness representing most/all possible theoretical syllables that don't actually occur in words, but this may be a naive misunderstanding on my part). 2. Are there patterns to exploit? For Korean, ALL of the Hangul characters are actually combinations of several base letters. Unicode encodes them all sequentially in a pattern where the conversion to their constitutent letters is purely algorithmic, but there seems to be no clean pattern in the legacy encodings, as the encodings started out just incoding the "important" ones then adding less important combinations in separate ranges. Worst-case, adding Korean and Traditional Chinese tables will roughly double the size of iconv.o to around 150k. This will noticably enlarge libc.so, but will make no difference to static-linked programs except those using iconv. I'm hoping we can make these additions less expensive, but I don't see a good way yet. At some point, especially if the cost is not reduced, I will probably add build-time options to exclude a configurable subset of the supported character encodings. This would not be extremely fine-grained, and the choices to exclude would probably be just: Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy 8-bit might also be an option but these are so small I can't think of cases where it would be beneficial to omit them (5k for the tables on top of the 2k of actual code in iconv). Perhaps if there are cases where iconv is needed purely for conversion between different Unicode forms, but no legacy charsets, on tiny embedded devices, dropping the 8-bit tables and all of the support code could be useful; the resulting iconv would be around 1k, I think. Rich