mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Harald Becker <ralda@gmx.de>
Cc: musl@lists.openwall.com, dalias@aerifal.cx
Subject: Re: iconv Korean and Traditional Chinese research so far
Date: Mon, 5 Aug 2013 03:53:12 +0200	[thread overview]
Message-ID: <20130805035312.5d874012@ralda.gmx.de> (raw)
In-Reply-To: <20130805004915.GA221@brightrain.aerifal.cx>

Hi Rich !

04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>:

> Do I want to add that size? No, of course not, and that's why
> I'm hoping (but not optimistic) that there may be a way to
> elide a good part of the table based on patterns in the Hangul
> syllables or the possibility that the giant extensions are
> unimportant.

I think there is a way for easy configuration. See other mails,
they clarify what my intention is.

> Do I want to give users who have large volumes of legacy text
> in their languages stored in these encodings the same respect
> and dignity as users of other legacy encodings we already
> support? Yes.

Of course. I won't dictate others which conversions they want to
use. I only hat to have plenty of conversion tables on my system
when I really know I never use such kind of conversions. ... but
in case I really need, it can be added dynamically to the running
system.


> > Why cant we have all this character conversions on a state
> > driven machine which loads its information from a external
> > configuration file? This way we can have any kind of
> > conversion someone likes, by just adding the configuration
> > file for the required Unicode to X and X to Unicode
> > conversions.
> 
> This issue was discussed a long time ago and the consensus
> among users of static linking was that static linking is most
> valuable when it makes the binary completely "portable" to
> arbitrary Linux systems for the same cpu arch, without any
> dependency on having files in particular locations on the
> system aside from the minimum required by POSIX (things
> like /dev/null), the standard Linux /proc mountpoint, and
> universal config files like /etc/resolv.conf (even that is not
> necessary, BTW, if you have a DNS on localhost). Having iconv
> not work without external character tables is essentially a
> form of dynamic linking, and carries with it issues like where
> the files are to be found (you can override that with an
> environment variable, but that can't be permitted for setuid
> binaries), what happens if the format needs to change and the
> format on the target machine is not compatible with the libc
> version your binary was built with, etc. This is also the main
> reason musl does not support something like nss.

I see the topic of self contained linking, and you are right that
is is required, but it is fully possible to have best of both
worlds without much overhead. Writing iconv as a virtual machine
interpreter allows to statical link in the conversion byte code
programs. Those who are not linked in, can be searched for in the
filesystem. And a simple configuration option may disable file
system search completely, for really small embedded operation.
But beside this all conversions are the same and may be
freely copied between architectures, or linked statically into a
user program (just put byte stream of selected charsets into
simple C array of bytes).

> Another side benefit of the current implementation is that it's
> fully self-contained and independent of any system facilities.
> It's pure C and can be taken out of musl and dropped in to any
> program on any C implementation, including freestanding
> (non-hosted) implementations. If it depended on the filesystem,
> adapting it for such usage would be a lot more work.

The virtual machine shall be written in C, I've done such type of
programming many times. So resulting code will compile with any C
compiler, and byte code programs are just array of bytes,
independent of machine byte order. So you will have any further
dependencies.

> A fsm implementation would be several times larger than the
> implementations in iconv.c.

A bit larger, yes ... but not so much, if virtual machine gets
designed carefully, and it will not increase in size, when there
are more charsets get added (only size of byte code program
added).


> It's possible that we could, at some time in the future,
> support loading of user-defined character conversion files as
> an added feature, but this should only be for really
> special-purpose things like custom encodings used for games or
> obsolete systems (old Mac, console games, IBM mainframes, etc.).

We can have it all, with not much overhead. And it is not only
for such special cases. I don't like to install musl on my
systems with Japanese, Chinese or Korean conversions, but in case
I really need, I'm able to throw them in, without much work.

... and we can add every character conversion on the fly, without
rebuild of the library.

> In terms of the criteria for what to include in musl itself, my
> idea is that if you have a mail client or web browser based on
> iconv for its character set handling, you should be able to
> read the bulk of content in any language.

If you are building a mail client or web browser, but what if you
want to include the possibility of charset conversion but stay at
small size, just including conversions for only system relevant
conversions, but not limiting to those. Any other conversion can
then be added on the fly.

--
Harald


  reply	other threads:[~2013-08-05  1:53 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-04 16:51 Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker [this message]
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker
2013-08-05  5:00 ` Rich Felker
2013-08-05  8:28 ` Roy
2013-08-05 15:43   ` Rich Felker
2013-08-05 17:31     ` Rich Felker
2013-08-05 19:12   ` Rich Felker
2013-08-06  6:14     ` Roy
2013-08-06 13:32       ` Rich Felker
2013-08-06 15:11         ` Roy
2013-08-06 16:22           ` Rich Felker
2013-08-07  0:54             ` Roy
2013-08-07  7:20               ` Roy
     [not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net>
2013-08-05 12:46 ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130805035312.5d874012@ralda.gmx.de \
    --to=ralda@gmx.de \
    --cc=dalias@aerifal.cx \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).