From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3819
Path: news.gmane.org!not-for-mail
From: Harald Becker <ralda@gmx.de>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: iconv Korean and Traditional Chinese research so far
Date: Mon, 5 Aug 2013 03:24:52 +0200
Message-ID: <20130805032452.280127fd@ralda.gmx.de>
References: <20130804165152.GA32076@brightrain.aerifal.cx>
	<20130805003943.050fc58e@ralda.gmx.de>
	<20130805004420.GL25714@port70.net>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Trace: ger.gmane.org 1375665904 2227 80.91.229.3 (5 Aug 2013 01:25:04 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Mon, 5 Aug 2013 01:25:04 +0000 (UTC)
Cc: musl@lists.openwall.com, nsz@port70.net
Original-X-From: musl-return-3823-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 03:25:06 2013
Return-path: <musl-return-3823-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3823-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1V69Xl-0001qr-RR
	for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 03:25:05 +0200
Original-Received: (qmail 9741 invoked by uid 550); 5 Aug 2013 01:25:05 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 9733 invoked from network); 5 Aug 2013 01:25:04 -0000
In-Reply-To: <20130805004420.GL25714@port70.net>
X-Provags-ID: V03:K0:xeHtxQeWbniyELHAAkp+rZtt2Xu7gx7Lxf6O1rSwobVSJojDbrK
 GBRktOw1J6jb9mIzoQh2maff8rOmO1DG6fVaF8dhDplP/XAHi61KdlOmGznrH9pBnK2ZJ//
 fCm4Wgut0Cer2MoAFrQkz2BqEYglzXgL1Ab5fx8guabsdpbOvPJ6glzSYiQ7nSuF1aawrv7
 x6Xdx34MwT9xdSQuIrkqw==
Xref: news.gmane.org gmane.linux.lib.musl.general:3819
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3819>

Hi !

05-08-2013 02:44 Szabolcs Nagy <nsz@port70.net>:

> * Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]:
> > Why cant we have all this character conversions on a state
> > driven machine which loads its information from a external
> > configuration file? This way we can have any kind of
> > conversion someone likes, by just adding the configuration
> > file for the required Unicode to X and X to Unicode
> > conversions.
> 
> external files provided by libc can work but they
> should be possible to embed into the binary

As far as I know, does glibc create small dynamically linked
objects and load those when required. This is architecture
specific. So you always need conversion files which correspond
to your C library.

My intention is to write conversion as a machine independent byte
code, which may be copied between machines of different
architecture. You need a charset conversion, just add the charset
bytecode to the conversion directory, which may be configurable
(directory name from environ variable with default fallback). May
even be a search path for conversion files, so conversion files
may be installed in different locations.

> otherwise a static binary is not self-contained
> and you have to move parts of the libc around
> along with the binary and if they are loaded
> from fixed path then it does not work at all
> (permissions, conflicting versions etc)

Ok, I see the static linking topic, but this is no problem with
byte code conversion programs. It can easily be added: Just add
all the conversion byte code programs together to a single big
array, with a name and offset table ahead, then link it into your
program.

May be done in two steps:

1) Create a selection file for musl build, and include the
specified charsets in libc.a/.so

2) Select the required charset files and create an .o file to
link into your program.


iconv then shall:
- look for some fixed charsets like ASCII, Latin-1, UTF-8, etc.
- search table of with libc linked charsets
- search table of with the program linked charsets
- search for charset on external search path

... or do in opposite direction and use first charset
conversion found.

This lookup is usually very small, except file system search, so
it shall not produce much overhead / bloat.

[Addendum after thinking a bit more: The byte code conversion
files shall exist of a small statical header, followed by the
byte code program. The header shall contain the charset name,
version of required virtual machine and length of byte code. So
you need only add all such conversion files to a big array of
bytes and add a Null header to mark the end of table. Then you
only need the start of the array and you are able to search
through for a specific charset. The iconv function in libc
contains a definition for an "unsigned char const
*iconv_user_charsets = NULL;", which is linked in, when the user
does not provide it's own definition. So iconv can search all
linked in charset definitions, and need no code changes. Really
simple configuration to select charsets to build in.]

> if the format changes then dynamic linking is
> problematic as well: you cannot update libc
> in a single atomic operation

The byte code shall be independent of dynamic linking. The
conversion files are only streams of bytes, which shall also be
architecture independent. So you do only need to update the
conversion files if the virtual machine definition of iconv has
been changed (shall not be done much). External files may be read
into malloc-ed buffers or mmap-ed, not linked in by the
dynamical linker.

--
Harald