From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3819 Path: news.gmane.org!not-for-mail From: Harald Becker Newsgroups: gmane.linux.lib.musl.general Subject: Re: iconv Korean and Traditional Chinese research so far Date: Mon, 5 Aug 2013 03:24:52 +0200 Message-ID: <20130805032452.280127fd@ralda.gmx.de> References: <20130804165152.GA32076@brightrain.aerifal.cx> <20130805003943.050fc58e@ralda.gmx.de> <20130805004420.GL25714@port70.net> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1375665904 2227 80.91.229.3 (5 Aug 2013 01:25:04 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 5 Aug 2013 01:25:04 +0000 (UTC) Cc: musl@lists.openwall.com, nsz@port70.net Original-X-From: musl-return-3823-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 03:25:06 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V69Xl-0001qr-RR for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 03:25:05 +0200 Original-Received: (qmail 9741 invoked by uid 550); 5 Aug 2013 01:25:05 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 9733 invoked from network); 5 Aug 2013 01:25:04 -0000 In-Reply-To: <20130805004420.GL25714@port70.net> X-Provags-ID: V03:K0:xeHtxQeWbniyELHAAkp+rZtt2Xu7gx7Lxf6O1rSwobVSJojDbrK GBRktOw1J6jb9mIzoQh2maff8rOmO1DG6fVaF8dhDplP/XAHi61KdlOmGznrH9pBnK2ZJ// fCm4Wgut0Cer2MoAFrQkz2BqEYglzXgL1Ab5fx8guabsdpbOvPJ6glzSYiQ7nSuF1aawrv7 x6Xdx34MwT9xdSQuIrkqw== Xref: news.gmane.org gmane.linux.lib.musl.general:3819 Archived-At: Hi ! 05-08-2013 02:44 Szabolcs Nagy : > * Harald Becker [2013-08-05 00:39:43 +0200]: > > Why cant we have all this character conversions on a state > > driven machine which loads its information from a external > > configuration file? This way we can have any kind of > > conversion someone likes, by just adding the configuration > > file for the required Unicode to X and X to Unicode > > conversions. > > external files provided by libc can work but they > should be possible to embed into the binary As far as I know, does glibc create small dynamically linked objects and load those when required. This is architecture specific. So you always need conversion files which correspond to your C library. My intention is to write conversion as a machine independent byte code, which may be copied between machines of different architecture. You need a charset conversion, just add the charset bytecode to the conversion directory, which may be configurable (directory name from environ variable with default fallback). May even be a search path for conversion files, so conversion files may be installed in different locations. > otherwise a static binary is not self-contained > and you have to move parts of the libc around > along with the binary and if they are loaded > from fixed path then it does not work at all > (permissions, conflicting versions etc) Ok, I see the static linking topic, but this is no problem with byte code conversion programs. It can easily be added: Just add all the conversion byte code programs together to a single big array, with a name and offset table ahead, then link it into your program. May be done in two steps: 1) Create a selection file for musl build, and include the specified charsets in libc.a/.so 2) Select the required charset files and create an .o file to link into your program. iconv then shall: - look for some fixed charsets like ASCII, Latin-1, UTF-8, etc. - search table of with libc linked charsets - search table of with the program linked charsets - search for charset on external search path ... or do in opposite direction and use first charset conversion found. This lookup is usually very small, except file system search, so it shall not produce much overhead / bloat. [Addendum after thinking a bit more: The byte code conversion files shall exist of a small statical header, followed by the byte code program. The header shall contain the charset name, version of required virtual machine and length of byte code. So you need only add all such conversion files to a big array of bytes and add a Null header to mark the end of table. Then you only need the start of the array and you are able to search through for a specific charset. The iconv function in libc contains a definition for an "unsigned char const *iconv_user_charsets = NULL;", which is linked in, when the user does not provide it's own definition. So iconv can search all linked in charset definitions, and need no code changes. Really simple configuration to select charsets to build in.] > if the format changes then dynamic linking is > problematic as well: you cannot update libc > in a single atomic operation The byte code shall be independent of dynamic linking. The conversion files are only streams of bytes, which shall also be architecture independent. So you do only need to update the conversion files if the virtual machine definition of iconv has been changed (shall not be done much). External files may be read into malloc-ed buffers or mmap-ed, not linked in by the dynamical linker. -- Harald