From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3829
Path: news.gmane.org!not-for-mail
From: Roy <roytam@gmail.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: iconv Korean and Traditional Chinese research so far
Date: Mon, 05 Aug 2013 16:28:32 +0800
Message-ID: <op.w1b4hubbdyj81a@monster.itedn32a.localdomain>
References: <20130804165152.GA32076@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
X-Trace: ger.gmane.org 1375691323 3622 80.91.229.3 (5 Aug 2013 08:28:43 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 5 Aug 2013 08:28:43 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-3833-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 10:28:47 2013
Return-path: <musl-return-3833-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3833-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1V6G9m-0000rL-LZ
	for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 10:28:46 +0200
Original-Received: (qmail 18409 invoked by uid 550); 5 Aug 2013 08:28:46 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 18401 invoked from network); 5 Aug 2013 08:28:46 -0000
X-Injected-Via-Gmane: http://gmane.org/
Original-Lines: 102
Original-X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: 203186096008.static.ctinets.com
User-Agent: Opera Mail/11.64 (Win32)
Xref: news.gmane.org gmane.linux.lib.musl.general:3829
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3829>

Since I'm a Traditional Chinese and Japanese legacy encoding user, I think  
I can say something here.

Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> OK, so here's what I've found so far. Both legacy Korean and legacy
> Traditional Chinese encodings have essentially a single base character
> set:
>

>
> Traditional Chinese:
> Big5 (CP950)
> 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
> All characters in BMP
> 27946 bytes table space
>
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
>
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
>

There is another Big5 extension called Big5-UAO, which is being used in  
world's largest telnet-based BBS called "ptt.cc".

It has two tables, one for Big5-UAO to Unicode, another one is Unicode to  
Big5-UAO.
http://moztw.org/docs/big5/table/uao250-b2u.txt
http://moztw.org/docs/big5/table/uao250-u2b.txt

Which extends DBCS lead byte to 0x81.

> The big remaining questions are:
>
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).

For Big5-UAO, it contains Japanese and Simplified Chinese characters which  
do not exist in original MS-CP950 implementation.

>
> 2. Are there patterns to exploit? For Korean, ALL of the Hangul
> characters are actually combinations of several base letters. Unicode
> encodes them all sequentially in a pattern where the conversion to
> their constitutent letters is purely algorithmic, but there seems to
> be no clean pattern in the legacy encodings, as the encodings started
> out just incoding the "important" ones then adding less important
> combinations in separate ranges.

In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in  
Japanese) and Japanese Katakana/Hiragana besides of Hangul characters.

>
> Worst-case, adding Korean and Traditional Chinese tables will roughly
> double the size of iconv.o to around 150k. This will noticably enlarge
> libc.so, but will make no difference to static-linked programs except
> those using iconv. I'm hoping we can make these additions less
> expensive, but I don't see a good way yet.

For static linking, can we have conditional linking like QT does?
In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables.

#ifndef QT_SHARED
     #include <QtPlugin>

     Q_IMPORT_PLUGIN(qcncodecs)
     Q_IMPORT_PLUGIN(qjpcodecs)
     Q_IMPORT_PLUGIN(qkrcodecs)
     Q_IMPORT_PLUGIN(qtwcodecs)
#endif


>
> At some point, especially if the cost is not reduced, I will probably
> add build-time options to exclude a configurable subset of the
> supported character encodings. This would not be extremely
> fine-grained, and the choices to exclude would probably be just:
> Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
> 8-bit might also be an option but these are so small I can't think of
> cases where it would be beneficial to omit them (5k for the tables on
> top of the 2k of actual code in iconv). Perhaps if there are cases
> where iconv is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices, dropping the
> 8-bit tables and all of the support code could be useful; the
> resulting iconv would be around 1k, I think.
>
> Rich
>

HTH,
Roy