mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@aerifal.cx>
To: musl@lists.openwall.com
Subject: Re: Re: iconv Korean and Traditional Chinese research so far
Date: Mon, 5 Aug 2013 11:43:45 -0400	[thread overview]
Message-ID: <20130805154344.GJ221@brightrain.aerifal.cx> (raw)
In-Reply-To: <op.w1b4hubbdyj81a@monster.itedn32a.localdomain>

On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote:
> Since I'm a Traditional Chinese and Japanese legacy encoding user, I
> think I can say something here.

Great, thanks for joining in with some constructive input! :)

> >Traditional Chinese:
> >HKSCS (CP951)
> >Lead byte range is extended to 88-FE (119)
> >1651 characters outside BMP
> >37366 bytes table space for 16-bit mapping table, plus extra mapping
> >needed for characters outside BMP
> 
> There is another Big5 extension called Big5-UAO, which is being used
> in world's largest telnet-based BBS called "ptt.cc".
> 
> It has two tables, one for Big5-UAO to Unicode, another one is
> Unicode to Big5-UAO.
> http://moztw.org/docs/big5/table/uao250-b2u.txt
> http://moztw.org/docs/big5/table/uao250-u2b.txt
> 
> Which extends DBCS lead byte to 0x81.

Is it a superset of HKSCS or does it assign different characters to
the range covered by HKSCS?

> In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji
> characters in Japanese) and Japanese Katakana/Hiragana besides of
> Hangul characters.

Yes, I'm aware of these. However, it looks to me like the only
characters outside the standard 94x94 grid zone are Hangul syllables,
and they appear in codepoint order. If so, even if there's not a good
pattern to where they're located, merely knowing that the ones that
are missing from the 94x94 grid are placed in order in the expanded
space is sufficient to perform algorithmic (albeit inefficient)
conversion. Does this sound correct?

> >Worst-case, adding Korean and Traditional Chinese tables will roughly
> >double the size of iconv.o to around 150k. This will noticably enlarge
> >libc.so, but will make no difference to static-linked programs except
> >those using iconv. I'm hoping we can make these additions less
> >expensive, but I don't see a good way yet.
> 
> For static linking, can we have conditional linking like QT does?

My feeling is that it's a tradeoff, and probably has more pros than
cons. Unlike QT, musl's iconv is extremely small. Even with all the
above, the size of iconv.o will be under 130k, maybe closer to 110k.
If you actually use iconv in your program, this is a small price to
pay for having it fully functional. On the other hand, if linking it
is conditional, you have to consider who makes the decision, and when.
If it's at link time for each application, that's probably too much of
a musl-specific version. If it's at build time for musl, then is it
your device vendor deciding for you what languages you need? One of
the biggest headaches of uClibc-based systems is finding that the
system libc was built with important options you need turned off and
that you need to hack in a replacement to get something working...

I think the cost of getting stuck with broken binaries where charsets
were omitted is sufficiently greater than the cost of adding a few
tens of kb to static binaries using iconv, that we should only
consider a build time option if embedded users are actively reporting
size problems.

Rich


  reply	other threads:[~2013-08-05 15:43 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-04 16:51 Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker
2013-08-05  5:00 ` Rich Felker
2013-08-05  8:28 ` Roy
2013-08-05 15:43   ` Rich Felker [this message]
2013-08-05 17:31     ` Rich Felker
2013-08-05 19:12   ` Rich Felker
2013-08-06  6:14     ` Roy
2013-08-06 13:32       ` Rich Felker
2013-08-06 15:11         ` Roy
2013-08-06 16:22           ` Rich Felker
2013-08-07  0:54             ` Roy
2013-08-07  7:20               ` Roy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130805154344.GJ221@brightrain.aerifal.cx \
    --to=dalias@aerifal.cx \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).