From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3812
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@aerifal.cx>
Newsgroups: gmane.linux.lib.musl.general
Subject: iconv Korean and Traditional Chinese research so far
Date: Sun, 4 Aug 2013 12:51:52 -0400
Message-ID: <20130804165152.GA32076@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1375635125 15906 80.91.229.3 (4 Aug 2013 16:52:05 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 4 Aug 2013 16:52:05 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-3816-gllmg-musl=m.gmane.org@lists.openwall.com Sun Aug 04 18:52:08 2013
Return-path: <musl-return-3816-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3816-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1V61XK-0003ie-G7
	for gllmg-musl@plane.gmane.org; Sun, 04 Aug 2013 18:52:06 +0200
Original-Received: (qmail 20465 invoked by uid 550); 4 Aug 2013 16:52:04 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 20453 invoked from network); 4 Aug 2013 16:52:04 -0000
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Xref: news.gmane.org gmane.linux.lib.musl.general:3812
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3812>

OK, so here's what I've found so far. Both legacy Korean and legacy
Traditional Chinese encodings have essentially a single base character
set:

Korean:
KS X 1001 (previously known as KS C 5601)
93 x 94 DBCS grid (A1-FD A1-FE)
All characters in BMP
17484 bytes table space

Traditional Chinese:
Big5 (CP950)
89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
All characters in BMP
27946 bytes table space

Both of these have various minor extensions, but the main extensions
of any relevance seem to be:

Korean:
CP949
Lead byte range is extended to 81-FD (125)
Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
44500 bytes table space

Traditional Chinese:
HKSCS (CP951)
Lead byte range is extended to 88-FE (119)
1651 characters outside BMP
37366 bytes table space for 16-bit mapping table, plus extra mapping
needed for characters outside BMP

The big remaining questions are:

1. How important are these extensions? I would guess the answer is
"fairly important", espectially for HKSCS where I believe the
additional characters are needed for encoding Cantonese words, but
it's less clear to me whether the Korean extensions are useful (they
seem to mainly be for the sake of completeness representing most/all
possible theoretical syllables that don't actually occur in words, but
this may be a naive misunderstanding on my part).

2. Are there patterns to exploit? For Korean, ALL of the Hangul
characters are actually combinations of several base letters. Unicode
encodes them all sequentially in a pattern where the conversion to
their constitutent letters is purely algorithmic, but there seems to
be no clean pattern in the legacy encodings, as the encodings started
out just incoding the "important" ones then adding less important
combinations in separate ranges.

Worst-case, adding Korean and Traditional Chinese tables will roughly
double the size of iconv.o to around 150k. This will noticably enlarge
libc.so, but will make no difference to static-linked programs except
those using iconv. I'm hoping we can make these additions less
expensive, but I don't see a good way yet.

At some point, especially if the cost is not reduced, I will probably
add build-time options to exclude a configurable subset of the
supported character encodings. This would not be extremely
fine-grained, and the choices to exclude would probably be just:
Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
8-bit might also be an option but these are so small I can't think of
cases where it would be beneficial to omit them (5k for the tables on
top of the 2k of actual code in iconv). Perhaps if there are cases
where iconv is needed purely for conversion between different Unicode
forms, but no legacy charsets, on tiny embedded devices, dropping the
8-bit tables and all of the support code could be useful; the
resulting iconv would be around 1k, I think.

Rich