Re: iconv Korean and Traditional Chinese research so far

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: Harald Becker <ralda@gmx.de>
Cc: musl@lists.openwall.com, dalias@aerifal.cx
Subject: Re: iconv Korean and Traditional Chinese research so far
Date: Mon, 5 Aug 2013 02:46:51 +0200	[thread overview]
Message-ID: <20130805024651.6d7e1b7e@ralda.gmx.de> (raw)
In-Reply-To: <20130804165152.GA32076@brightrain.aerifal.cx>

Hi Rich,

in addition to my previous message to clarify some things:

04-08-2013 12:51 Rich Felker <dalias@aerifal.cx>:

> Worst-case, adding Korean and Traditional Chinese tables will
> roughly double the size of iconv.o to around 150k. This will
> noticably enlarge libc.so, but will make no difference to
> static-linked programs except those using iconv. I'm hoping we
> can make these additions less expensive, but I don't see a good
> way yet.

I would write iconv as a virtual machine interpreter for a very
simple byte code machine. The byte code (program) of the virtual
machine is just an array of unsigned bytes and the virtual
machine only contains the instructions to read next byte and
assemble a Unicode value or to receive a Unicode value and to
produce multi byte character output. The virtual machine code
itself works like a finite state machine to handle multi byte
character sets.

That way iconv consist of a small byte code interpreter to build
the virtual machine. Then it maps the byte code from an external
file for any required character set. This byte code from external
file consist of virtual machine instructions and conversion
tables. As this virtual machine shall be optimized for the
conversion purposes, conversion operations require only
interpretation of a view virtual instructions per converted
character (for simple character sets, big ones may need a few
more instructions). This operation is usually very fast, as not
much data is involved and instructions are highly optimized for
conversion operation.

The virtual machine works with a data space of only a few bytes
(less than 256), where some bytes need to preserve from one
conversion call to next. That is conversion needs a conversion
context of a few bytes (8..16). 

Independently from any character set conversion you want to add,
you only need a single byte code interpreter for iconv, which
will not increase in size. Only the external byte code /
conversion table for the charsets may vary in size. Simple char
sets, like Latins, consist of only a few bytes of byte code, big
charsets like Japanese, Chinese and Korean, need some more byte
code and may be some bigger translation tables ... but those
tables are only loaded if iconv needs to access such a charset.

iconv itself doesn't need to handle table of available charsets,
it only converts the charset name into a filename and opens the
corresponding charset translation file. On the charset file some
header and version check shall handle possible installation
conflicts. For any conversion request the virtual machine
interpreter runs through the byte code of the requested charset
and returns the conversion result. As the virtual machine shall
not contain operations to violate the remainder of the system,
this shall not break system security. At most the byte code is so
misbehaved that it runs forever, without producing an error or
any output. So the machine hangs just in an infinite loop during
conversion, until the process is terminated (a simple counter may
limit number of executed instructions and bail out in case of such
looping).

> At some point, especially if the cost is not reduced, I will
> probably add build-time options to exclude a configurable
> subset of the supported character encodings. This would not be
> extremely fine-grained, and the choices to exclude would
> probably be just: Japanese, Simplified Chinese, Traditional
> Chinese, and Korean. Legacy 8-bit might also be an option but
> these are so small I can't think of cases where it would be
> beneficial to omit them (5k for the tables on top of the 2k of
> actual code in iconv). Perhaps if there are cases where iconv
> is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices,
> dropping the 8-bit tables and all of the support code could be
> useful; the resulting iconv would be around 1k, I think.

You may skip all this, if iconv is constructed as a virtual
machine interpreter and all character conversions are loaded from
an external file.

As a fallback the library may compile in the byte code for some
small charset conversions, like ASCII, Latin-1, UTF-8. All other
charset conversions are loaded from external resources, which may
be installed or not depending on admins decision. And just
added if required later.  

--
Harald

next prev parent reply	other threads:[~2013-08-05  0:46 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-04 16:51 Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker [this message]
2013-08-05  5:00 ` Rich Felker
2013-08-05  8:28 ` Roy
2013-08-05 15:43   ` Rich Felker
2013-08-05 17:31     ` Rich Felker
2013-08-05 19:12   ` Rich Felker
2013-08-06  6:14     ` Roy
2013-08-06 13:32       ` Rich Felker
2013-08-06 15:11         ` Roy
2013-08-06 16:22           ` Rich Felker
2013-08-07  0:54             ` Roy
2013-08-07  7:20               ` Roy
     [not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net>
2013-08-05 12:46 ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130805024651.6d7e1b7e@ralda.gmx.de \
    --to=ralda@gmx.de \
    --cc=dalias@aerifal.cx \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).