From: Roy <roytam@gmail.com>
To: musl@lists.openwall.com
Subject: Re: iconv Korean and Traditional Chinese research so far
Date: Mon, 05 Aug 2013 16:28:32 +0800 [thread overview]
Message-ID: <op.w1b4hubbdyj81a@monster.itedn32a.localdomain> (raw)
In-Reply-To: <20130804165152.GA32076@brightrain.aerifal.cx>
Since I'm a Traditional Chinese and Japanese legacy encoding user, I think
I can say something here.
Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker <dalias@aerifal.cx> wrote:
> OK, so here's what I've found so far. Both legacy Korean and legacy
> Traditional Chinese encodings have essentially a single base character
> set:
>
>
> Traditional Chinese:
> Big5 (CP950)
> 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
> All characters in BMP
> 27946 bytes table space
>
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
>
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
>
There is another Big5 extension called Big5-UAO, which is being used in
world's largest telnet-based BBS called "ptt.cc".
It has two tables, one for Big5-UAO to Unicode, another one is Unicode to
Big5-UAO.
http://moztw.org/docs/big5/table/uao250-b2u.txt
http://moztw.org/docs/big5/table/uao250-u2b.txt
Which extends DBCS lead byte to 0x81.
> The big remaining questions are:
>
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).
For Big5-UAO, it contains Japanese and Simplified Chinese characters which
do not exist in original MS-CP950 implementation.
>
> 2. Are there patterns to exploit? For Korean, ALL of the Hangul
> characters are actually combinations of several base letters. Unicode
> encodes them all sequentially in a pattern where the conversion to
> their constitutent letters is purely algorithmic, but there seems to
> be no clean pattern in the legacy encodings, as the encodings started
> out just incoding the "important" ones then adding less important
> combinations in separate ranges.
In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in
Japanese) and Japanese Katakana/Hiragana besides of Hangul characters.
>
> Worst-case, adding Korean and Traditional Chinese tables will roughly
> double the size of iconv.o to around 150k. This will noticably enlarge
> libc.so, but will make no difference to static-linked programs except
> those using iconv. I'm hoping we can make these additions less
> expensive, but I don't see a good way yet.
For static linking, can we have conditional linking like QT does?
In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables.
#ifndef QT_SHARED
#include <QtPlugin>
Q_IMPORT_PLUGIN(qcncodecs)
Q_IMPORT_PLUGIN(qjpcodecs)
Q_IMPORT_PLUGIN(qkrcodecs)
Q_IMPORT_PLUGIN(qtwcodecs)
#endif
>
> At some point, especially if the cost is not reduced, I will probably
> add build-time options to exclude a configurable subset of the
> supported character encodings. This would not be extremely
> fine-grained, and the choices to exclude would probably be just:
> Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
> 8-bit might also be an option but these are so small I can't think of
> cases where it would be beneficial to omit them (5k for the tables on
> top of the 2k of actual code in iconv). Perhaps if there are cases
> where iconv is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices, dropping the
> 8-bit tables and all of the support code could be useful; the
> resulting iconv would be around 1k, I think.
>
> Rich
>
HTH,
Roy
next prev parent reply other threads:[~2013-08-05 8:28 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-04 16:51 Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05 0:44 ` Szabolcs Nagy
2013-08-05 1:24 ` Harald Becker
2013-08-05 3:13 ` Szabolcs Nagy
2013-08-05 7:03 ` Harald Becker
2013-08-05 12:54 ` Rich Felker
2013-08-05 0:49 ` Rich Felker
2013-08-05 1:53 ` Harald Becker
2013-08-05 3:39 ` Rich Felker
2013-08-05 7:53 ` Harald Becker
2013-08-05 8:24 ` Justin Cormack
2013-08-05 14:43 ` Rich Felker
2013-08-05 14:35 ` Rich Felker
2013-08-05 0:46 ` Harald Becker
2013-08-05 5:00 ` Rich Felker
2013-08-05 8:28 ` Roy [this message]
2013-08-05 15:43 ` Rich Felker
2013-08-05 17:31 ` Rich Felker
2013-08-05 19:12 ` Rich Felker
2013-08-06 6:14 ` Roy
2013-08-06 13:32 ` Rich Felker
2013-08-06 15:11 ` Roy
2013-08-06 16:22 ` Rich Felker
2013-08-07 0:54 ` Roy
2013-08-07 7:20 ` Roy
[not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net>
2013-08-05 12:46 ` Rich Felker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=op.w1b4hubbdyj81a@monster.itedn32a.localdomain \
--to=roytam@gmail.com \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).