From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3815
Path: news.gmane.org!not-for-mail
From: Harald Becker <ralda@gmx.de>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: iconv Korean and Traditional Chinese research so far
Date: Mon, 5 Aug 2013 02:46:51 +0200
Message-ID: <20130805024651.6d7e1b7e@ralda.gmx.de>
References: <20130804165152.GA32076@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Trace: ger.gmane.org 1375663624 15488 80.91.229.3 (5 Aug 2013 00:47:04 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 5 Aug 2013 00:47:04 +0000 (UTC)
Cc: musl@lists.openwall.com, dalias@aerifal.cx
Original-X-From: musl-return-3819-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 02:47:08 2013
Return-path: <musl-return-3819-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3819-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1V68wz-0007Sm-NH
	for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 02:47:05 +0200
Original-Received: (qmail 24574 invoked by uid 550); 5 Aug 2013 00:47:05 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 24563 invoked from network); 5 Aug 2013 00:47:05 -0000
In-Reply-To: <20130804165152.GA32076@brightrain.aerifal.cx>
X-Provags-ID: V03:K0:ZhnCjEHI8m1K7ePW4VyC7zSVuiRQI/18ldyDJylqa5R6NCge8a6
 RYi4/eK9ckFWWqN0fRfbDjBBMVQuHFU53mZcjhSgZLbT5OkAWC/pQjD0R+mbi2srwNoVuVx
 UNRIz27bhiolpFRGr+OufPGKvcyauym0Qi+5ruk/kKe+qFit84dR85iUOLOpzhU1MtC913z
 apl83UDKaaAVt1g3U3K5A==
Xref: news.gmane.org gmane.linux.lib.musl.general:3815
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3815>

Hi Rich,

in addition to my previous message to clarify some things:

04-08-2013 12:51 Rich Felker <dalias@aerifal.cx>:

> Worst-case, adding Korean and Traditional Chinese tables will
> roughly double the size of iconv.o to around 150k. This will
> noticably enlarge libc.so, but will make no difference to
> static-linked programs except those using iconv. I'm hoping we
> can make these additions less expensive, but I don't see a good
> way yet.

I would write iconv as a virtual machine interpreter for a very
simple byte code machine. The byte code (program) of the virtual
machine is just an array of unsigned bytes and the virtual
machine only contains the instructions to read next byte and
assemble a Unicode value or to receive a Unicode value and to
produce multi byte character output. The virtual machine code
itself works like a finite state machine to handle multi byte
character sets.

That way iconv consist of a small byte code interpreter to build
the virtual machine. Then it maps the byte code from an external
file for any required character set. This byte code from external
file consist of virtual machine instructions and conversion
tables. As this virtual machine shall be optimized for the
conversion purposes, conversion operations require only
interpretation of a view virtual instructions per converted
character (for simple character sets, big ones may need a few
more instructions). This operation is usually very fast, as not
much data is involved and instructions are highly optimized for
conversion operation.

The virtual machine works with a data space of only a few bytes
(less than 256), where some bytes need to preserve from one
conversion call to next. That is conversion needs a conversion
context of a few bytes (8..16). 

Independently from any character set conversion you want to add,
you only need a single byte code interpreter for iconv, which
will not increase in size. Only the external byte code /
conversion table for the charsets may vary in size. Simple char
sets, like Latins, consist of only a few bytes of byte code, big
charsets like Japanese, Chinese and Korean, need some more byte
code and may be some bigger translation tables ... but those
tables are only loaded if iconv needs to access such a charset.

iconv itself doesn't need to handle table of available charsets,
it only converts the charset name into a filename and opens the
corresponding charset translation file. On the charset file some
header and version check shall handle possible installation
conflicts. For any conversion request the virtual machine
interpreter runs through the byte code of the requested charset
and returns the conversion result. As the virtual machine shall
not contain operations to violate the remainder of the system,
this shall not break system security. At most the byte code is so
misbehaved that it runs forever, without producing an error or
any output. So the machine hangs just in an infinite loop during
conversion, until the process is terminated (a simple counter may
limit number of executed instructions and bail out in case of such
looping).

> At some point, especially if the cost is not reduced, I will
> probably add build-time options to exclude a configurable
> subset of the supported character encodings. This would not be
> extremely fine-grained, and the choices to exclude would
> probably be just: Japanese, Simplified Chinese, Traditional
> Chinese, and Korean. Legacy 8-bit might also be an option but
> these are so small I can't think of cases where it would be
> beneficial to omit them (5k for the tables on top of the 2k of
> actual code in iconv). Perhaps if there are cases where iconv
> is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices,
> dropping the 8-bit tables and all of the support code could be
> useful; the resulting iconv would be around 1k, I think.

You may skip all this, if iconv is constructed as a virtual
machine interpreter and all character conversions are loaded from
an external file.

As a fallback the library may compile in the byte code for some
small charset conversions, like ASCII, Latin-1, UTF-8. All other
charset conversions are loaded from external resources, which may
be installed or not depending on admins decision. And just
added if required later.  

--
Harald