From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3820 Path: news.gmane.org!not-for-mail From: Harald Becker Newsgroups: gmane.linux.lib.musl.general Subject: Re: iconv Korean and Traditional Chinese research so far Date: Mon, 5 Aug 2013 03:53:12 +0200 Message-ID: <20130805035312.5d874012@ralda.gmx.de> References: <20130804165152.GA32076@brightrain.aerifal.cx> <20130805003943.050fc58e@ralda.gmx.de> <20130805004915.GA221@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1375667605 16165 80.91.229.3 (5 Aug 2013 01:53:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 5 Aug 2013 01:53:25 +0000 (UTC) Cc: musl@lists.openwall.com, dalias@aerifal.cx Original-X-From: musl-return-3824-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 03:53:27 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V69zD-00064c-52 for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 03:53:27 +0200 Original-Received: (qmail 23739 invoked by uid 550); 5 Aug 2013 01:53:25 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 23726 invoked from network); 5 Aug 2013 01:53:25 -0000 In-Reply-To: <20130805004915.GA221@brightrain.aerifal.cx> X-Provags-ID: V03:K0:4+j/CaJ3/lj4VHahPUY0YMf1tJ1/Fh8DmEgW8uBmcH1fkysROYH lzVDNLeL4+MKuKVX17RLF6xgnVIDkExHj0X3iDTCTlrN8uzvfO53XZ9WQJNeyQhmT8bncy9 yNnvJno9LeP1avuSqOICLz7AoiS9pf9ZYUOZgG4TGNVwqth5TsP5OlGuws9HBfNlzVfWs8X VRBFtcf2qIMEElaRlqb2g== Xref: news.gmane.org gmane.linux.lib.musl.general:3820 Archived-At: Hi Rich ! 04-08-2013 20:49 Rich Felker : > Do I want to add that size? No, of course not, and that's why > I'm hoping (but not optimistic) that there may be a way to > elide a good part of the table based on patterns in the Hangul > syllables or the possibility that the giant extensions are > unimportant. I think there is a way for easy configuration. See other mails, they clarify what my intention is. > Do I want to give users who have large volumes of legacy text > in their languages stored in these encodings the same respect > and dignity as users of other legacy encodings we already > support? Yes. Of course. I won't dictate others which conversions they want to use. I only hat to have plenty of conversion tables on my system when I really know I never use such kind of conversions. ... but in case I really need, it can be added dynamically to the running system. > > Why cant we have all this character conversions on a state > > driven machine which loads its information from a external > > configuration file? This way we can have any kind of > > conversion someone likes, by just adding the configuration > > file for the required Unicode to X and X to Unicode > > conversions. > > This issue was discussed a long time ago and the consensus > among users of static linking was that static linking is most > valuable when it makes the binary completely "portable" to > arbitrary Linux systems for the same cpu arch, without any > dependency on having files in particular locations on the > system aside from the minimum required by POSIX (things > like /dev/null), the standard Linux /proc mountpoint, and > universal config files like /etc/resolv.conf (even that is not > necessary, BTW, if you have a DNS on localhost). Having iconv > not work without external character tables is essentially a > form of dynamic linking, and carries with it issues like where > the files are to be found (you can override that with an > environment variable, but that can't be permitted for setuid > binaries), what happens if the format needs to change and the > format on the target machine is not compatible with the libc > version your binary was built with, etc. This is also the main > reason musl does not support something like nss. I see the topic of self contained linking, and you are right that is is required, but it is fully possible to have best of both worlds without much overhead. Writing iconv as a virtual machine interpreter allows to statical link in the conversion byte code programs. Those who are not linked in, can be searched for in the filesystem. And a simple configuration option may disable file system search completely, for really small embedded operation. But beside this all conversions are the same and may be freely copied between architectures, or linked statically into a user program (just put byte stream of selected charsets into simple C array of bytes). > Another side benefit of the current implementation is that it's > fully self-contained and independent of any system facilities. > It's pure C and can be taken out of musl and dropped in to any > program on any C implementation, including freestanding > (non-hosted) implementations. If it depended on the filesystem, > adapting it for such usage would be a lot more work. The virtual machine shall be written in C, I've done such type of programming many times. So resulting code will compile with any C compiler, and byte code programs are just array of bytes, independent of machine byte order. So you will have any further dependencies. > A fsm implementation would be several times larger than the > implementations in iconv.c. A bit larger, yes ... but not so much, if virtual machine gets designed carefully, and it will not increase in size, when there are more charsets get added (only size of byte code program added). > It's possible that we could, at some time in the future, > support loading of user-defined character conversion files as > an added feature, but this should only be for really > special-purpose things like custom encodings used for games or > obsolete systems (old Mac, console games, IBM mainframes, etc.). We can have it all, with not much overhead. And it is not only for such special cases. I don't like to install musl on my systems with Japanese, Chinese or Korean conversions, but in case I really need, I'm able to throw them in, without much work. ... and we can add every character conversion on the fly, without rebuild of the library. > In terms of the criteria for what to include in musl itself, my > idea is that if you have a mail client or web browser based on > iconv for its character set handling, you should be able to > read the bulk of content in any language. If you are building a mail client or web browser, but what if you want to include the possibility of charset conversion but stay at small size, just including conversions for only system relevant conversions, but not limiting to those. Any other conversion can then be added on the fly. -- Harald