From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3834 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: iconv Korean and Traditional Chinese research so far Date: Mon, 5 Aug 2013 11:43:45 -0400 Message-ID: <20130805154344.GJ221@brightrain.aerifal.cx> References: <20130804165152.GA32076@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375717437 5694 80.91.229.3 (5 Aug 2013 15:43:57 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 5 Aug 2013 15:43:57 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3838-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 17:44:00 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V6Mww-0008QX-Qq for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 17:43:58 +0200 Original-Received: (qmail 22358 invoked by uid 550); 5 Aug 2013 15:43:57 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 22349 invoked from network); 5 Aug 2013 15:43:57 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3834 Archived-At: On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote: > Since I'm a Traditional Chinese and Japanese legacy encoding user, I > think I can say something here. Great, thanks for joining in with some constructive input! :) > >Traditional Chinese: > >HKSCS (CP951) > >Lead byte range is extended to 88-FE (119) > >1651 characters outside BMP > >37366 bytes table space for 16-bit mapping table, plus extra mapping > >needed for characters outside BMP > > There is another Big5 extension called Big5-UAO, which is being used > in world's largest telnet-based BBS called "ptt.cc". > > It has two tables, one for Big5-UAO to Unicode, another one is > Unicode to Big5-UAO. > http://moztw.org/docs/big5/table/uao250-b2u.txt > http://moztw.org/docs/big5/table/uao250-u2b.txt > > Which extends DBCS lead byte to 0x81. Is it a superset of HKSCS or does it assign different characters to the range covered by HKSCS? > In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji > characters in Japanese) and Japanese Katakana/Hiragana besides of > Hangul characters. Yes, I'm aware of these. However, it looks to me like the only characters outside the standard 94x94 grid zone are Hangul syllables, and they appear in codepoint order. If so, even if there's not a good pattern to where they're located, merely knowing that the ones that are missing from the 94x94 grid are placed in order in the expanded space is sufficient to perform algorithmic (albeit inefficient) conversion. Does this sound correct? > >Worst-case, adding Korean and Traditional Chinese tables will roughly > >double the size of iconv.o to around 150k. This will noticably enlarge > >libc.so, but will make no difference to static-linked programs except > >those using iconv. I'm hoping we can make these additions less > >expensive, but I don't see a good way yet. > > For static linking, can we have conditional linking like QT does? My feeling is that it's a tradeoff, and probably has more pros than cons. Unlike QT, musl's iconv is extremely small. Even with all the above, the size of iconv.o will be under 130k, maybe closer to 110k. If you actually use iconv in your program, this is a small price to pay for having it fully functional. On the other hand, if linking it is conditional, you have to consider who makes the decision, and when. If it's at link time for each application, that's probably too much of a musl-specific version. If it's at build time for musl, then is it your device vendor deciding for you what languages you need? One of the biggest headaches of uClibc-based systems is finding that the system libc was built with important options you need turned off and that you need to hack in a replacement to get something working... I think the cost of getting stuck with broken binaries where charsets were omitted is sufficiently greater than the cost of adding a few tens of kb to static binaries using iconv, that we should only consider a build time option if embedded users are actively reporting size problems. Rich