From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3829 Path: news.gmane.org!not-for-mail From: Roy Newsgroups: gmane.linux.lib.musl.general Subject: Re: iconv Korean and Traditional Chinese research so far Date: Mon, 05 Aug 2013 16:28:32 +0800 Message-ID: References: <20130804165152.GA32076@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1375691323 3622 80.91.229.3 (5 Aug 2013 08:28:43 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 5 Aug 2013 08:28:43 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3833-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 10:28:47 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V6G9m-0000rL-LZ for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 10:28:46 +0200 Original-Received: (qmail 18409 invoked by uid 550); 5 Aug 2013 08:28:46 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 18401 invoked from network); 5 Aug 2013 08:28:46 -0000 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 102 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 203186096008.static.ctinets.com User-Agent: Opera Mail/11.64 (Win32) Xref: news.gmane.org gmane.linux.lib.musl.general:3829 Archived-At: Since I'm a Traditional Chinese and Japanese legacy encoding user, I think I can say something here. Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker wrote: > OK, so here's what I've found so far. Both legacy Korean and legacy > Traditional Chinese encodings have essentially a single base character > set: > > > Traditional Chinese: > Big5 (CP950) > 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE) > All characters in BMP > 27946 bytes table space > > Both of these have various minor extensions, but the main extensions > of any relevance seem to be: > > Traditional Chinese: > HKSCS (CP951) > Lead byte range is extended to 88-FE (119) > 1651 characters outside BMP > 37366 bytes table space for 16-bit mapping table, plus extra mapping > needed for characters outside BMP > There is another Big5 extension called Big5-UAO, which is being used in world's largest telnet-based BBS called "ptt.cc". It has two tables, one for Big5-UAO to Unicode, another one is Unicode to Big5-UAO. http://moztw.org/docs/big5/table/uao250-b2u.txt http://moztw.org/docs/big5/table/uao250-u2b.txt Which extends DBCS lead byte to 0x81. > The big remaining questions are: > > 1. How important are these extensions? I would guess the answer is > "fairly important", espectially for HKSCS where I believe the > additional characters are needed for encoding Cantonese words, but > it's less clear to me whether the Korean extensions are useful (they > seem to mainly be for the sake of completeness representing most/all > possible theoretical syllables that don't actually occur in words, but > this may be a naive misunderstanding on my part). For Big5-UAO, it contains Japanese and Simplified Chinese characters which do not exist in original MS-CP950 implementation. > > 2. Are there patterns to exploit? For Korean, ALL of the Hangul > characters are actually combinations of several base letters. Unicode > encodes them all sequentially in a pattern where the conversion to > their constitutent letters is purely algorithmic, but there seems to > be no clean pattern in the legacy encodings, as the encodings started > out just incoding the "important" ones then adding less important > combinations in separate ranges. In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in Japanese) and Japanese Katakana/Hiragana besides of Hangul characters. > > Worst-case, adding Korean and Traditional Chinese tables will roughly > double the size of iconv.o to around 150k. This will noticably enlarge > libc.so, but will make no difference to static-linked programs except > those using iconv. I'm hoping we can make these additions less > expensive, but I don't see a good way yet. For static linking, can we have conditional linking like QT does? In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables. #ifndef QT_SHARED #include Q_IMPORT_PLUGIN(qcncodecs) Q_IMPORT_PLUGIN(qjpcodecs) Q_IMPORT_PLUGIN(qkrcodecs) Q_IMPORT_PLUGIN(qtwcodecs) #endif > > At some point, especially if the cost is not reduced, I will probably > add build-time options to exclude a configurable subset of the > supported character encodings. This would not be extremely > fine-grained, and the choices to exclude would probably be just: > Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy > 8-bit might also be an option but these are so small I can't think of > cases where it would be beneficial to omit them (5k for the tables on > top of the 2k of actual code in iconv). Perhaps if there are cases > where iconv is needed purely for conversion between different Unicode > forms, but no legacy charsets, on tiny embedded devices, dropping the > 8-bit tables and all of the support code could be useful; the > resulting iconv would be around 1k, I think. > > Rich > HTH, Roy