iconv Korean and Traditional Chinese research so far

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: Rich Felker <dalias@aerifal.cx>
To: musl@lists.openwall.com
Subject: iconv Korean and Traditional Chinese research so far
Date: Sun, 4 Aug 2013 12:51:52 -0400	[thread overview]
Message-ID: <20130804165152.GA32076@brightrain.aerifal.cx> (raw)

OK, so here's what I've found so far. Both legacy Korean and legacy
Traditional Chinese encodings have essentially a single base character
set:

Korean:
KS X 1001 (previously known as KS C 5601)
93 x 94 DBCS grid (A1-FD A1-FE)
All characters in BMP
17484 bytes table space

Traditional Chinese:
Big5 (CP950)
89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
All characters in BMP
27946 bytes table space

Both of these have various minor extensions, but the main extensions
of any relevance seem to be:

Korean:
CP949
Lead byte range is extended to 81-FD (125)
Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
44500 bytes table space

Traditional Chinese:
HKSCS (CP951)
Lead byte range is extended to 88-FE (119)
1651 characters outside BMP
37366 bytes table space for 16-bit mapping table, plus extra mapping
needed for characters outside BMP

The big remaining questions are:

1. How important are these extensions? I would guess the answer is
"fairly important", espectially for HKSCS where I believe the
additional characters are needed for encoding Cantonese words, but
it's less clear to me whether the Korean extensions are useful (they
seem to mainly be for the sake of completeness representing most/all
possible theoretical syllables that don't actually occur in words, but
this may be a naive misunderstanding on my part).

2. Are there patterns to exploit? For Korean, ALL of the Hangul
characters are actually combinations of several base letters. Unicode
encodes them all sequentially in a pattern where the conversion to
their constitutent letters is purely algorithmic, but there seems to
be no clean pattern in the legacy encodings, as the encodings started
out just incoding the "important" ones then adding less important
combinations in separate ranges.

Worst-case, adding Korean and Traditional Chinese tables will roughly
double the size of iconv.o to around 150k. This will noticably enlarge
libc.so, but will make no difference to static-linked programs except
those using iconv. I'm hoping we can make these additions less
expensive, but I don't see a good way yet.

At some point, especially if the cost is not reduced, I will probably
add build-time options to exclude a configurable subset of the
supported character encodings. This would not be extremely
fine-grained, and the choices to exclude would probably be just:
Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
8-bit might also be an option but these are so small I can't think of
cases where it would be beneficial to omit them (5k for the tables on
top of the 2k of actual code in iconv). Perhaps if there are cases
where iconv is needed purely for conversion between different Unicode
forms, but no legacy charsets, on tiny embedded devices, dropping the
8-bit tables and all of the support code could be useful; the
resulting iconv would be around 1k, I think.

Rich

next             reply	other threads:[~2013-08-04 16:51 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-04 16:51 Rich Felker [this message]
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker
2013-08-05  5:00 ` Rich Felker
2013-08-05  8:28 ` Roy
2013-08-05 15:43   ` Rich Felker
2013-08-05 17:31     ` Rich Felker
2013-08-05 19:12   ` Rich Felker
2013-08-06  6:14     ` Roy
2013-08-06 13:32       ` Rich Felker
2013-08-06 15:11         ` Roy
2013-08-06 16:22           ` Rich Felker
2013-08-07  0:54             ` Roy
2013-08-07  7:20               ` Roy
     [not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net>
2013-08-05 12:46 ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130804165152.GA32076@brightrain.aerifal.cx \
    --to=dalias@aerifal.cx \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).