* Re: iconv Korean and Traditional Chinese research so far [not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net> @ 2013-08-05 12:46 ` Rich Felker 0 siblings, 0 replies; 18+ messages in thread From: Rich Felker @ 2013-08-05 12:46 UTC (permalink / raw) To: musl On Sun, Aug 04, 2013 at 11:28:16PM -0700, writeonce@midipix.org wrote: > -------- Original Message -------- > Subject: Re: [musl] iconv Korean and Traditional Chinese research so far > From: Rich Felker <[1]dalias@aerifal.cx> > Date: Sun, August 04, 2013 10:00 pm > To: [2]musl@lists.openwall.com > Being that HKSCS is a standard, registered MIME charset and the cost > is only 10k, and that it seems necessary for real world usage in Hong > Kong, I think it's pretty obvious that we should support it. So I > think the question we're left with is whether the CP949 (MS encoding) > extension for Korean is important to support. The cost is roughly 37k. > > In case that helps: Korean typesetting packages for > TeX/LaTeX/XeLaTeX/LuaLaTeX do provide support for CP949. This includes > the most recent versions; see, for instance: > [3]http://osl.ugr.es/CTAN/macros/luatex/generic/luatexko/luatexko-uhc2utf8.lua > There are also several packages that support UHC, which allegedly overlaps > with CP949: > [4]http://www.ctan.org/topic/korean > Best regards, > zg Thanks. I've also seen a fair number of bug reports asking for programs that lack it to add support, so it seems there is at least some demand. I'm going to go ahead and look into whether there's a way we could do slow but algorithmic conversion (e.g. are all the characters outside the standard range allocated sequentually based on the set missing from the standard range?) since that would make the question of whether or not it's needed fairly irrelevant. Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* iconv Korean and Traditional Chinese research so far @ 2013-08-04 16:51 Rich Felker 2013-08-04 22:39 ` Harald Becker ` (3 more replies) 0 siblings, 4 replies; 18+ messages in thread From: Rich Felker @ 2013-08-04 16:51 UTC (permalink / raw) To: musl OK, so here's what I've found so far. Both legacy Korean and legacy Traditional Chinese encodings have essentially a single base character set: Korean: KS X 1001 (previously known as KS C 5601) 93 x 94 DBCS grid (A1-FD A1-FE) All characters in BMP 17484 bytes table space Traditional Chinese: Big5 (CP950) 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE) All characters in BMP 27946 bytes table space Both of these have various minor extensions, but the main extensions of any relevance seem to be: Korean: CP949 Lead byte range is extended to 81-FD (125) Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126) 44500 bytes table space Traditional Chinese: HKSCS (CP951) Lead byte range is extended to 88-FE (119) 1651 characters outside BMP 37366 bytes table space for 16-bit mapping table, plus extra mapping needed for characters outside BMP The big remaining questions are: 1. How important are these extensions? I would guess the answer is "fairly important", espectially for HKSCS where I believe the additional characters are needed for encoding Cantonese words, but it's less clear to me whether the Korean extensions are useful (they seem to mainly be for the sake of completeness representing most/all possible theoretical syllables that don't actually occur in words, but this may be a naive misunderstanding on my part). 2. Are there patterns to exploit? For Korean, ALL of the Hangul characters are actually combinations of several base letters. Unicode encodes them all sequentially in a pattern where the conversion to their constitutent letters is purely algorithmic, but there seems to be no clean pattern in the legacy encodings, as the encodings started out just incoding the "important" ones then adding less important combinations in separate ranges. Worst-case, adding Korean and Traditional Chinese tables will roughly double the size of iconv.o to around 150k. This will noticably enlarge libc.so, but will make no difference to static-linked programs except those using iconv. I'm hoping we can make these additions less expensive, but I don't see a good way yet. At some point, especially if the cost is not reduced, I will probably add build-time options to exclude a configurable subset of the supported character encodings. This would not be extremely fine-grained, and the choices to exclude would probably be just: Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy 8-bit might also be an option but these are so small I can't think of cases where it would be beneficial to omit them (5k for the tables on top of the 2k of actual code in iconv). Perhaps if there are cases where iconv is needed purely for conversion between different Unicode forms, but no legacy charsets, on tiny embedded devices, dropping the 8-bit tables and all of the support code could be useful; the resulting iconv would be around 1k, I think. Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 16:51 Rich Felker @ 2013-08-04 22:39 ` Harald Becker 2013-08-05 0:44 ` Szabolcs Nagy 2013-08-05 0:49 ` Rich Felker 2013-08-05 0:46 ` Harald Becker ` (2 subsequent siblings) 3 siblings, 2 replies; 18+ messages in thread From: Harald Becker @ 2013-08-04 22:39 UTC (permalink / raw) Cc: musl, dalias Hi Rich ! > Worst-case, adding Korean and Traditional Chinese tables will > roughly double the size of iconv.o to around 150k. This will > noticably enlarge libc.so, but will make no difference to > static-linked programs except those using iconv. I'm hoping we > can make these additions less expensive, but I don't see a good > way yet. Oh nooo, do you really want to add this statically to the iconv version? Why cant we have all this character conversions on a state driven machine which loads its information from a external configuration file? This way we can have any kind of conversion someone likes, by just adding the configuration file for the required Unicode to X and X to Unicode conversions. State driven fsm interpreters are really small and fast and may read it's complete configuration from a file ... architecture independent file, so we may have same character conversion files for all architectures. > At some point, especially if the cost is not reduced, I will > probably add build-time options to exclude a configurable > subset of the supported character encodings. All this would go, if you do not load character conversions from a static table. Why don't you consider loading a conversion file for a given character set from predefined or configurable directory. With the name of the character set as filename. If you want to be the file in a directly read/modifiable form, you need to add a minimalistic parser, else the file contents may be considered binary data and you can just fread or mmap the file and use the data to control character set conversion. Most conversions only need minimal space, only some require bigger conversion routines. ... and for those who dislike, you just don't need to install the conversion files you do not want. -- Harald ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 22:39 ` Harald Becker @ 2013-08-05 0:44 ` Szabolcs Nagy 2013-08-05 1:24 ` Harald Becker 2013-08-05 0:49 ` Rich Felker 1 sibling, 1 reply; 18+ messages in thread From: Szabolcs Nagy @ 2013-08-05 0:44 UTC (permalink / raw) To: musl * Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]: > Why cant we have all this character conversions on a state driven > machine which loads its information from a external configuration > file? This way we can have any kind of conversion someone likes, > by just adding the configuration file for the required Unicode to > X and X to Unicode conversions. external files provided by libc can work but they should be possible to embed into the binary otherwise a static binary is not self-contained and you have to move parts of the libc around along with the binary and if they are loaded from fixed path then it does not work at all (permissions, conflicting versions etc) if the format changes then dynamic linking is problematic as well: you cannot update libc in a single atomic operation ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 0:44 ` Szabolcs Nagy @ 2013-08-05 1:24 ` Harald Becker 2013-08-05 3:13 ` Szabolcs Nagy 0 siblings, 1 reply; 18+ messages in thread From: Harald Becker @ 2013-08-05 1:24 UTC (permalink / raw) Cc: musl, nsz Hi ! 05-08-2013 02:44 Szabolcs Nagy <nsz@port70.net>: > * Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]: > > Why cant we have all this character conversions on a state > > driven machine which loads its information from a external > > configuration file? This way we can have any kind of > > conversion someone likes, by just adding the configuration > > file for the required Unicode to X and X to Unicode > > conversions. > > external files provided by libc can work but they > should be possible to embed into the binary As far as I know, does glibc create small dynamically linked objects and load those when required. This is architecture specific. So you always need conversion files which correspond to your C library. My intention is to write conversion as a machine independent byte code, which may be copied between machines of different architecture. You need a charset conversion, just add the charset bytecode to the conversion directory, which may be configurable (directory name from environ variable with default fallback). May even be a search path for conversion files, so conversion files may be installed in different locations. > otherwise a static binary is not self-contained > and you have to move parts of the libc around > along with the binary and if they are loaded > from fixed path then it does not work at all > (permissions, conflicting versions etc) Ok, I see the static linking topic, but this is no problem with byte code conversion programs. It can easily be added: Just add all the conversion byte code programs together to a single big array, with a name and offset table ahead, then link it into your program. May be done in two steps: 1) Create a selection file for musl build, and include the specified charsets in libc.a/.so 2) Select the required charset files and create an .o file to link into your program. iconv then shall: - look for some fixed charsets like ASCII, Latin-1, UTF-8, etc. - search table of with libc linked charsets - search table of with the program linked charsets - search for charset on external search path ... or do in opposite direction and use first charset conversion found. This lookup is usually very small, except file system search, so it shall not produce much overhead / bloat. [Addendum after thinking a bit more: The byte code conversion files shall exist of a small statical header, followed by the byte code program. The header shall contain the charset name, version of required virtual machine and length of byte code. So you need only add all such conversion files to a big array of bytes and add a Null header to mark the end of table. Then you only need the start of the array and you are able to search through for a specific charset. The iconv function in libc contains a definition for an "unsigned char const *iconv_user_charsets = NULL;", which is linked in, when the user does not provide it's own definition. So iconv can search all linked in charset definitions, and need no code changes. Really simple configuration to select charsets to build in.] > if the format changes then dynamic linking is > problematic as well: you cannot update libc > in a single atomic operation The byte code shall be independent of dynamic linking. The conversion files are only streams of bytes, which shall also be architecture independent. So you do only need to update the conversion files if the virtual machine definition of iconv has been changed (shall not be done much). External files may be read into malloc-ed buffers or mmap-ed, not linked in by the dynamical linker. -- Harald ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 1:24 ` Harald Becker @ 2013-08-05 3:13 ` Szabolcs Nagy 2013-08-05 7:03 ` Harald Becker 0 siblings, 1 reply; 18+ messages in thread From: Szabolcs Nagy @ 2013-08-05 3:13 UTC (permalink / raw) To: musl * Harald Becker <ralda@gmx.de> [2013-08-05 03:24:52 +0200]: > iconv then shall: > - look for some fixed charsets like ASCII, Latin-1, UTF-8, etc. > - search table of with libc linked charsets > - search table of with the program linked charsets > - search for charset on external search path sounds like a lot of extra management cost (for libc, application writer and user as well) it would be nice if the compiler could figure out at build time (eg with lto) which tables are used but i guess charsets often only known at runtime > [Addendum after thinking a bit more: The byte code conversion > files shall exist of a small statical header, followed by the > byte code program. The header shall contain the charset name, > version of required virtual machine and length of byte code. So > you need only add all such conversion files to a big array of > bytes and add a Null header to mark the end of table. Then you > only need the start of the array and you are able to search > through for a specific charset. The iconv function in libc > contains a definition for an "unsigned char const > *iconv_user_charsets = NULL;", which is linked in, when the user > does not provide it's own definition. So iconv can search all > linked in charset definitions, and need no code changes. Really > simple configuration to select charsets to build in.] > yes that can work, but it's a musl specific hack that the application programmer need to take care of > > if the format changes then dynamic linking is > > problematic as well: you cannot update libc > > in a single atomic operation > > The byte code shall be independent of dynamic linking. The > conversion files are only streams of bytes, which shall also be > architecture independent. So you do only need to update the > conversion files if the virtual machine definition of iconv has > been changed (shall not be done much). External files may be read > into malloc-ed buffers or mmap-ed, not linked in by the > dynamical linker. > that does not solve the format change problem you cannot update libc without race (unless you first replace the .so which supports the old format as well as the new one, but then libc has to support all previous formats) it's probably easy to design a fixed format to avoid this it seems somewhat similar to the timezone problem ecxept zoneinfo is maintained outside of libc so there is not much choice, but there are the same issues: updating it should be done carefully, setuid programs must be handled specially etc ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 3:13 ` Szabolcs Nagy @ 2013-08-05 7:03 ` Harald Becker 2013-08-05 12:54 ` Rich Felker 0 siblings, 1 reply; 18+ messages in thread From: Harald Becker @ 2013-08-05 7:03 UTC (permalink / raw) Cc: musl, nsz Hi ! 05-08-2013 05:13 Szabolcs Nagy <nsz@port70.net>: > * Harald Becker <ralda@gmx.de> [2013-08-05 03:24:52 +0200]: > > iconv then shall: > > - look for some fixed charsets like ASCII, Latin-1, UTF-8, > > etc. > > - search table of with libc linked charsets > > - search table of with the program linked charsets > > - search for charset on external search path > > sounds like a lot of extra management cost > (for libc, application writer and user as well) This is not so much work. You already need to search for the character set table to use, that is you need to search at least a table of string values to find the pointer to the conversion table. Searching a table in may statement above, means just waking a pointer chain doing the string compares to find a matching character set. Not much difference to the really required code. Now do this twice to check possible user chain, is just one more helper function call. The only code that get a bit more, is the file system search. This depends if we only try single location or walk through a search path list. But this is the cost of flexibility to dynamically load character set conversions (which I would really prefer for seldom used char sets). ... and for application writer it is only more, if he likes to add some charset tables into his program, which are not in statical libc. The problem is, all tables in libc need to be linked to your program, if you include iconv. So each added charset conversion increases size of your program ... and I definitly won't include Japanese, Chinese or Korean charsets in my program. No that I ignore those peoples need, I just wont need it, so I don't like to add those conversions to programs sitting on my disk. > it would be nice if the compiler could figure out > at build time (eg with lto) which tables are used > but i guess charsets often only known at runtime How do you want to do this? And how shall the compiler know which char sets the user may use during operation? So the only way to select the charset tables to include in your program, is by assuming ahead, which tables might be used. That is part of the configuration of musl build or application program build. > > [Addendum after thinking a bit more: The byte code conversion > > files shall exist of a small statical header, followed by the > > byte code program. The header shall contain the charset name, > > version of required virtual machine and length of byte code. > > So you need only add all such conversion files to a big array > > of bytes and add a Null header to mark the end of table. Then > > you only need the start of the array and you are able to > > search through for a specific charset. The iconv function in > > libc contains a definition for an "unsigned char const > > *iconv_user_charsets = NULL;", which is linked in, when the > > user does not provide it's own definition. So iconv can > > search all linked in charset definitions, and need no code > > changes. Really simple configuration to select charsets to > > build in.] > > > > yes that can work, but it's a musl specific hack > that the application programmer need to take care of Only if application programmer wants to add a char set to the statical build program, which is not in libc, some extra work has to be done. Giving some more flexibility. If you don't care, you get the musl build in list of char sets. > > > if the format changes then dynamic linking is > > > problematic as well: you cannot update libc > > > in a single atomic operation > > > > The byte code shall be independent of dynamic linking. The > > conversion files are only streams of bytes, which shall also > > be architecture independent. So you do only need to update the > > conversion files if the virtual machine definition of iconv > > has been changed (shall not be done much). External files may > > be read into malloc-ed buffers or mmap-ed, not linked in by > > the dynamical linker. > > > > that does not solve the format change problem > you cannot update libc without race > (unless you first replace the .so which supports > the old format as well as the new one, but then > libc has to support all previous formats) If the definition of the iconv virtual state machine is modified, you need to do extra care on update (delete old charset files, install new lib, install new charset files, restart system) ... but this is only required on a major update. As soon as the virtual machine definition gots stabilized you do not need to change charset definition files, or just do update your lib, then update possible new charset files. After an initial phase of testing this shall happen relatively seldom, that the virtual machine definition needs to be changed in an incompatible manner. And simple extending the virtual machine does not invalidate the old charset files. > it's probably easy to design a fixed format to > avoid this A fixed format? For what? Do you know the differences of char sets, especially multi byte char sets? > it seems somewhat similar to the timezone problem > ecxept zoneinfo is maintained outside of libc so > there is not much choice, but there are the same > issues: updating it should be done carefully, > setuid programs must be handled specially etc Again. As soon as the virtual machine definition has reached a stable state, it shall not happen much, that any change invalidates a charset definition file. That is at least old files will continue to work with newer lib versions. So there is no problem on update, just update your lib then update your charset files. The only problem will be, if a still running application uses a new charset file with an old version of the lib. This will be detected and leads to a failure code of iconv. So you need to restart your application ... which is always a good decision as you updated your lib. -- Harald ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 7:03 ` Harald Becker @ 2013-08-05 12:54 ` Rich Felker 0 siblings, 0 replies; 18+ messages in thread From: Rich Felker @ 2013-08-05 12:54 UTC (permalink / raw) To: musl; +Cc: nsz On Mon, Aug 05, 2013 at 09:03:43AM +0200, Harald Becker wrote: > The only code that get a bit more, is the file system search. > This depends if we only try single location or walk through a > search path list. But this is the cost of flexibility to > dynamically load character set conversions (which I would really > prefer for seldom used char sets). The only "seldom used char sets" are either extremely small (8bit codepages) or simply encoding variants of an existing CJK DBCS (in which case it's just a matter of code, not large data tables, to support them). > .... and for application writer it is only more, if he likes to > add some charset tables into his program, which are not in > statical libc. This is only helpful if the application writer is designing around musl. This is a practice we explicitly discourage. > The problem is, all tables in libc need to be linked to your > program, if you include iconv. So each added charset conversion > increases size of your program ... and I definitly won't include > Japanese, Chinese or Korean charsets in my program. No that I > ignore those peoples need, I just wont need it, so I don't like > to add those conversions to programs sitting on my disk. How many programs do you intend to use iconv in that _don't_ need to support arbitrary encodings including ones you might not be using yourself? Even if you don't read Korean, if a Korean user sends you an email containing non-ASCII punctuation, Greek letters like epsilon, etc. there's a fair chance their MUA will choose to encode with a legacy Korean encoding rather than UTF-8, and then you need the conversion. It would be nice if everybody encoded everything in UTF-8 so the recipient was not responsible for supporting a wide range of legacy encodings, but that's not the reality today. > If the definition of the iconv virtual state machine is modified, > you need to do extra care on update (delete old charset files, > install new lib, install new charset files, restart system) ... > but this is only required on a major update. As soon as the Even if there were really good reasons for the design you're proposing, such a violation of the stability and atomic upgrade policy would require a strong overriding justification. We don't have that here. Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 22:39 ` Harald Becker 2013-08-05 0:44 ` Szabolcs Nagy @ 2013-08-05 0:49 ` Rich Felker 2013-08-05 1:53 ` Harald Becker 1 sibling, 1 reply; 18+ messages in thread From: Rich Felker @ 2013-08-05 0:49 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 12:39:43AM +0200, Harald Becker wrote: > Hi Rich ! > > > Worst-case, adding Korean and Traditional Chinese tables will > > roughly double the size of iconv.o to around 150k. This will > > noticably enlarge libc.so, but will make no difference to > > static-linked programs except those using iconv. I'm hoping we > > can make these additions less expensive, but I don't see a good > > way yet. > > Oh nooo, do you really want to add this statically to the iconv > version? Do I want to add that size? No, of course not, and that's why I'm hoping (but not optimistic) that there may be a way to elide a good part of the table based on patterns in the Hangul syllables or the possibility that the giant extensions are unimportant. Do I want to give users who have large volumes of legacy text in their languages stored in these encodings the same respect and dignity as users of other legacy encodings we already support? Yes. > Why cant we have all this character conversions on a state driven > machine which loads its information from a external configuration > file? This way we can have any kind of conversion someone likes, > by just adding the configuration file for the required Unicode to > X and X to Unicode conversions. This issue was discussed a long time ago and the consensus among users of static linking was that static linking is most valuable when it makes the binary completely "portable" to arbitrary Linux systems for the same cpu arch, without any dependency on having files in particular locations on the system aside from the minimum required by POSIX (things like /dev/null), the standard Linux /proc mountpoint, and universal config files like /etc/resolv.conf (even that is not necessary, BTW, if you have a DNS on localhost). Having iconv not work without external character tables is essentially a form of dynamic linking, and carries with it issues like where the files are to be found (you can override that with an environment variable, but that can't be permitted for setuid binaries), what happens if the format needs to change and the format on the target machine is not compatible with the libc version your binary was built with, etc. This is also the main reason musl does not support something like nss. Another side benefit of the current implementation is that it's fully self-contained and independent of any system facilities. It's pure C and can be taken out of musl and dropped in to any program on any C implementation, including freestanding (non-hosted) implementations. If it depended on the filesystem, adapting it for such usage would be a lot more work. > State driven fsm interpreters are really small and fast and may > read it's complete configuration from a file ... architecture > independent file, so we may have same character conversion files > for all architectures. A fsm implementation would be several times larger than the implementations in iconv.c. It's possible that we could, at some time in the future, support loading of user-defined character conversion files as an added feature, but this should only be for really special-purpose things like custom encodings used for games or obsolete systems (old Mac, console games, IBM mainframes, etc.). In terms of the criteria for what to include in musl itself, my idea is that if you have a mail client or web browser based on iconv for its character set handling, you should be able to read the bulk of content in any language. Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 0:49 ` Rich Felker @ 2013-08-05 1:53 ` Harald Becker 2013-08-05 3:39 ` Rich Felker 0 siblings, 1 reply; 18+ messages in thread From: Harald Becker @ 2013-08-05 1:53 UTC (permalink / raw) Cc: musl, dalias Hi Rich ! 04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>: > Do I want to add that size? No, of course not, and that's why > I'm hoping (but not optimistic) that there may be a way to > elide a good part of the table based on patterns in the Hangul > syllables or the possibility that the giant extensions are > unimportant. I think there is a way for easy configuration. See other mails, they clarify what my intention is. > Do I want to give users who have large volumes of legacy text > in their languages stored in these encodings the same respect > and dignity as users of other legacy encodings we already > support? Yes. Of course. I won't dictate others which conversions they want to use. I only hat to have plenty of conversion tables on my system when I really know I never use such kind of conversions. ... but in case I really need, it can be added dynamically to the running system. > > Why cant we have all this character conversions on a state > > driven machine which loads its information from a external > > configuration file? This way we can have any kind of > > conversion someone likes, by just adding the configuration > > file for the required Unicode to X and X to Unicode > > conversions. > > This issue was discussed a long time ago and the consensus > among users of static linking was that static linking is most > valuable when it makes the binary completely "portable" to > arbitrary Linux systems for the same cpu arch, without any > dependency on having files in particular locations on the > system aside from the minimum required by POSIX (things > like /dev/null), the standard Linux /proc mountpoint, and > universal config files like /etc/resolv.conf (even that is not > necessary, BTW, if you have a DNS on localhost). Having iconv > not work without external character tables is essentially a > form of dynamic linking, and carries with it issues like where > the files are to be found (you can override that with an > environment variable, but that can't be permitted for setuid > binaries), what happens if the format needs to change and the > format on the target machine is not compatible with the libc > version your binary was built with, etc. This is also the main > reason musl does not support something like nss. I see the topic of self contained linking, and you are right that is is required, but it is fully possible to have best of both worlds without much overhead. Writing iconv as a virtual machine interpreter allows to statical link in the conversion byte code programs. Those who are not linked in, can be searched for in the filesystem. And a simple configuration option may disable file system search completely, for really small embedded operation. But beside this all conversions are the same and may be freely copied between architectures, or linked statically into a user program (just put byte stream of selected charsets into simple C array of bytes). > Another side benefit of the current implementation is that it's > fully self-contained and independent of any system facilities. > It's pure C and can be taken out of musl and dropped in to any > program on any C implementation, including freestanding > (non-hosted) implementations. If it depended on the filesystem, > adapting it for such usage would be a lot more work. The virtual machine shall be written in C, I've done such type of programming many times. So resulting code will compile with any C compiler, and byte code programs are just array of bytes, independent of machine byte order. So you will have any further dependencies. > A fsm implementation would be several times larger than the > implementations in iconv.c. A bit larger, yes ... but not so much, if virtual machine gets designed carefully, and it will not increase in size, when there are more charsets get added (only size of byte code program added). > It's possible that we could, at some time in the future, > support loading of user-defined character conversion files as > an added feature, but this should only be for really > special-purpose things like custom encodings used for games or > obsolete systems (old Mac, console games, IBM mainframes, etc.). We can have it all, with not much overhead. And it is not only for such special cases. I don't like to install musl on my systems with Japanese, Chinese or Korean conversions, but in case I really need, I'm able to throw them in, without much work. ... and we can add every character conversion on the fly, without rebuild of the library. > In terms of the criteria for what to include in musl itself, my > idea is that if you have a mail client or web browser based on > iconv for its character set handling, you should be able to > read the bulk of content in any language. If you are building a mail client or web browser, but what if you want to include the possibility of charset conversion but stay at small size, just including conversions for only system relevant conversions, but not limiting to those. Any other conversion can then be added on the fly. -- Harald ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 1:53 ` Harald Becker @ 2013-08-05 3:39 ` Rich Felker 2013-08-05 7:53 ` Harald Becker 0 siblings, 1 reply; 18+ messages in thread From: Rich Felker @ 2013-08-05 3:39 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 03:53:12AM +0200, Harald Becker wrote: > Hi Rich ! > > 04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>: > > > Do I want to add that size? No, of course not, and that's why > > I'm hoping (but not optimistic) that there may be a way to > > elide a good part of the table based on patterns in the Hangul > > syllables or the possibility that the giant extensions are > > unimportant. > > I think there is a way for easy configuration. See other mails, > they clarify what my intention is. I saw, and you're free to write such an iconv implementation if you like, but it's not right for musl. Inventing elaborate mechanisms to solve simple problems is the glibc way of doing things, not the musl way. iconv is not something that needs to be extensible. There is a finite set of legacy encodings that's relevant to the world, and their relevance is going to go down and down with time, not up. > > Do I want to give users who have large volumes of legacy text > > in their languages stored in these encodings the same respect > > and dignity as users of other legacy encodings we already > > support? Yes. > > Of course. I won't dictate others which conversions they want to > use. I only hat to have plenty of conversion tables on my system > when I really know I never use such kind of conversions. And your table for just Chinese is as large as all our tables combined... I agree you can make iconv smaller than musl's in the case where _no_ legacy DBCS are installed. But if you have just one, you'll be just as large or larger than musl with them all. Just compare the size of musl's tables to glibc's converters. I've worked hard to make them as small as reasonably possible without doing hideous hacks like decompression into an in-memory buffer, which would actually increase bloat. > ... but > in case I really need, it can be added dynamically to the running > system. If you have root or want to setup nonstandard environment variables. > > This issue was discussed a long time ago and the consensus > > among users of static linking was that static linking is most > > valuable when it makes the binary completely "portable" to > > arbitrary Linux systems for the same cpu arch, without any > > dependency on having files in particular locations on the > > system aside from the minimum required by POSIX (things > > like /dev/null), the standard Linux /proc mountpoint, and > > universal config files like /etc/resolv.conf (even that is not > > necessary, BTW, if you have a DNS on localhost). Having iconv > > not work without external character tables is essentially a > > form of dynamic linking, and carries with it issues like where > > the files are to be found (you can override that with an > > environment variable, but that can't be permitted for setuid > > binaries), what happens if the format needs to change and the > > format on the target machine is not compatible with the libc > > version your binary was built with, etc. This is also the main > > reason musl does not support something like nss. > > I see the topic of self contained linking, and you are right that > is is required, but it is fully possible to have best of both > worlds without much overhead. Writing iconv as a virtual machine It's not the best of both worlds. It's essentially the same as dynamic linking. > interpreter allows to statical link in the conversion byte code > programs. At several times the size of the current code/tables, and after the user searches through the documentation to figure out how to do it. > > Another side benefit of the current implementation is that it's > > fully self-contained and independent of any system facilities. > > It's pure C and can be taken out of musl and dropped in to any > > program on any C implementation, including freestanding > > (non-hosted) implementations. If it depended on the filesystem, > > adapting it for such usage would be a lot more work. > > The virtual machine shall be written in C, I've done such type of > programming many times. So resulting code will compile with any C > compiler, and byte code programs are just array of bytes, > independent of machine byte order. So you will have any further > dependencies. It's not just a matter of dropping in. You'd have path searches to modify or disable, build options to get the static tables turned on, and all of this stuff would have to be integrated with the build system for what you're dropping it into. Complexity is never the solution. Honestly, I would take a 1mb increase in binary size over this kind of complexity any day. Thankfully, we don't have to make such a tradeoff. > > A fsm implementation would be several times larger than the > > implementations in iconv.c. > > A bit larger, yes ... but not so much, if virtual machine gets > designed carefully, and it will not increase in size, when there > are more charsets get added (only size of byte code program > added). Charsets are not added. The time of charsets is over. It should have been over in 1992, when Pike and Thompson made them obsolete, but it's really over now. > > It's possible that we could, at some time in the future, > > support loading of user-defined character conversion files as > > an added feature, but this should only be for really > > special-purpose things like custom encodings used for games or > > obsolete systems (old Mac, console games, IBM mainframes, etc.). > > We can have it all, with not much overhead. And it is not only > for such special cases. I don't like to install musl on my > systems with Japanese, Chinese or Korean conversions, but in case > I really need, I'm able to throw them in, without much work. > > .... and we can add every character conversion on the fly, without > rebuild of the library. Maybe we should also include a bytecode interpreter for doing hostname lookups, since you might want to do something other than DNS or a hosts file. And a bytecode interpreter for user database lookups in place of passwd files. And a bytecode interpreter for adding new crypt() algorithms. And... > > In terms of the criteria for what to include in musl itself, my > > idea is that if you have a mail client or web browser based on > > iconv for its character set handling, you should be able to > > read the bulk of content in any language. > > If you are building a mail client or web browser, but what if you > want to include the possibility of charset conversion but stay at > small size, just including conversions for only system relevant > conversions, but not limiting to those. Any other conversion can > then be added on the fly. Then dynamic link it. If you want an extensible binary, you use dynamic linking. The main reason for static linking is when you want a binary whose behavior does not change with the runtime environment -- for example, for security purposes, for carrying around to other machines that don't have the same runtime environment, etc. Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 3:39 ` Rich Felker @ 2013-08-05 7:53 ` Harald Becker 2013-08-05 8:24 ` Justin Cormack 2013-08-05 14:35 ` Rich Felker 0 siblings, 2 replies; 18+ messages in thread From: Harald Becker @ 2013-08-05 7:53 UTC (permalink / raw) Cc: musl, dalias Hi Rich ! > iconv is not something that needs to be extensible. There is a > finite set of legacy encodings that's relevant to the world, > and their relevance is going to go down and down with time, not > up. Oh! So you consider Japanese, Chinese, Korean, etc. languages relevant for programs sitting on my machines? How can you decide this? Why being so ignorant and trying to write an standard conform library and then pick out a list of char sets of your choice which may be possible on iconv, neglecting wishes and need of any musl user. ... or in other words, if you really be this ignorant and insist on including those charsets fixed in musl, musl is never more for me :( ... I don't need to bring in any part of mine into musl, but I don't consider a lib usable for my needs, which include several char set files in statical build and neglects to load seldom used charset definitions from extern in any way. > > > > Do I want to give users who have large volumes of legacy > > > text in their languages stored in these encodings the same > > > respect and dignity as users of other legacy encodings we > > > already support? Yes. > > > > Of course. I won't dictate others which conversions they want > > to use. I only hat to have plenty of conversion tables on my > > system when I really know I never use such kind of > > conversions. > > And your table for just Chinese is as large as all our tables > combined... How can you tell this. I don't think so. Such conversion codes may be very compact. Size is mainly required for translation tables, that is when code points of the char sets does not match Unicode character order, but you always need the space for those translations. The rest won't be much. > I agree you can make iconv smaller than musl's in the case > where _no_ legacy DBCS are installed. But if you have just one, > you'll be just as large or larger than musl with them all. ... musl with them all? I don't consider them smaller than an optimized byte code interpreter ... not when you are going to include DBCS char sets fixed into musl. At least if you do all the required translations. > compare the size of musl's tables to glibc's converters. I've > worked hard to make them as small as reasonably possible > without doing hideous hacks like decompression into an > in-memory buffer, which would actually increase bloat. Are you now going to build a lib for startup purpose and embedded systems only or are you trying to write a general purpose library? Including all those definitions in a statical build is definitely not the way I will ever like. This may be done for some special situations and selected char sets, but not for a general purpose library, claiming to get a wide usage. > If you have root or want to setup nonstandard environment > variables. What about a charset searchpath including something like "~/.local/share/charset". This would allow to install charset files in the users directory. > > interpreter allows to statical link in the conversion byte > > code programs. > > At several times the size of the current code/tables, and after > the user searches through the documentation to figure out how > to do it. You definitely consider to include all those code tables statically into musl? I won't include much more than some standard sets. Why don't you want to load the charset definitions as they are required? On one hand you say "use dietlibc" if you need small statical programs and on the other hand you want to include many charset definitions into a statical build to avoid dynamic loading of tables, required only on embedded systems. So what's the purpose of musl? I don't think you stay right here. > It's not just a matter of dropping in. You'd have path searches > to modify or disable, build options to get the static tables > turned on, and all of this stuff would have to be integrated > with the build system for what you're dropping it into. I don't see the required complexity. In fact I won't have a lib that includes several charset definitions in a statical build. I really like to have a directory with definition files for those char sets and don't see the complexity for this you proclamate. Inclusion in statical build is not more than selection of the charsets you want o be included statically. This selection is always required or you include all files , which I definitly neglect. > Complexity is never the solution. Honestly, I would take a 1mb > increase in binary size over this kind of complexity any day. > Thankfully, we don't have to make such a tradeoff. The only complexity which we has here is the complexity of charset translation. The rest is relatively simple. > Charsets are not added. The time of charsets is over. It should > have been over in 1992, when Pike and Thompson made them > obsolete, but it's really over now. So why are you adding Japanese, Chinese and Korean charsets to an iconv conversion in musl? Why not just using UTF-8? Whenever you use iconv you want the flexibility to do all required charset conversions. Which means you need to statically link in many charset definitions or you need to dynamically load what is required. > Then dynamic link it. If you want an extensible binary, you use > dynamic linking. Dynamic linking of mail client, ok and where go the charset definition files? Are they all packed into your libc.so? That is a very big file? Why do I need to have Asian language definition on my disk, when I do not want? It is your decision, but please state clear what purpose you are building musl. Here it looks you are mixing things and steping in a direction I will never like. -- Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 7:53 ` Harald Becker @ 2013-08-05 8:24 ` Justin Cormack 2013-08-05 14:43 ` Rich Felker 2013-08-05 14:35 ` Rich Felker 1 sibling, 1 reply; 18+ messages in thread From: Justin Cormack @ 2013-08-05 8:24 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 6597 bytes --] On 5 Aug 2013 08:53, "Harald Becker" <ralda@gmx.de> wrote: > > Hi Rich ! > > > iconv is not something that needs to be extensible. There is a > > finite set of legacy encodings that's relevant to the world, > > and their relevance is going to go down and down with time, not > > up. > > Oh! So you consider Japanese, Chinese, Korean, etc. languages > relevant for programs sitting on my machines? How can you decide > this? Why being so ignorant and trying to write an standard > conform library and then pick out a list of char sets of your > choice which may be possible on iconv, neglecting wishes and > need of any musl user. > > ... or in other words, if you really be this ignorant and > insist on including those charsets fixed in musl, musl is never > more for me :( ... I don't need to bring in any part of mine into > musl, but I don't consider a lib usable for my needs, which > include several char set files in statical build and neglects to > load seldom used charset definitions from extern in any way. They are not going to be "fixed" just don't build them. It is not hard with Musl. Just add this into your build script. One of the nice features of Musl is that it appeals to a broader audience than just "embedded" so it is always going to have stuff you can cut out if you want absolute minimalism but this means it will get wider usage. Adding external files has many disadvantages to other people. If you don't want these conversions external files do not help you. Making software for more than one person involves compromises so please calm down a bit. Use your own embedded build with the parts you don't need omitted. Justin > > > > > > Do I want to give users who have large volumes of legacy > > > > text in their languages stored in these encodings the same > > > > respect and dignity as users of other legacy encodings we > > > > already support? Yes. > > > > > > Of course. I won't dictate others which conversions they want > > > to use. I only hat to have plenty of conversion tables on my > > > system when I really know I never use such kind of > > > conversions. > > > > And your table for just Chinese is as large as all our tables > > combined... > > How can you tell this. I don't think so. Such conversion codes > may be very compact. Size is mainly required for translation > tables, that is when code points of the char sets does not match > Unicode character order, but you always need the space for those > translations. The rest won't be much. > > > I agree you can make iconv smaller than musl's in the case > > where _no_ legacy DBCS are installed. But if you have just one, > > you'll be just as large or larger than musl with them all. > > ... musl with them all? I don't consider them smaller than an > optimized byte code interpreter ... not when you are going to > include DBCS char sets fixed into musl. At least if you do all > the required translations. > > > compare the size of musl's tables to glibc's converters. I've > > worked hard to make them as small as reasonably possible > > without doing hideous hacks like decompression into an > > in-memory buffer, which would actually increase bloat. > > Are you now going to build a lib for startup purpose and embedded > systems only or are you trying to write a general purpose > library? Including all those definitions in a statical build is > definitely not the way I will ever like. This may be done for > some special situations and selected char sets, but not for a > general purpose library, claiming to get a wide usage. > > > If you have root or want to setup nonstandard environment > > variables. > > What about a charset searchpath including something like > "~/.local/share/charset". This would allow to install charset > files in the users directory. > > > > interpreter allows to statical link in the conversion byte > > > code programs. > > > > At several times the size of the current code/tables, and after > > the user searches through the documentation to figure out how > > to do it. > > You definitely consider to include all those code tables > statically into musl? I won't include much more than some > standard sets. Why don't you want to load the charset definitions > as they are required? > > On one hand you say "use dietlibc" if you need small statical > programs and on the other hand you want to include many charset > definitions into a statical build to avoid dynamic loading of > tables, required only on embedded systems. > > So what's the purpose of musl? I don't think you stay right here. > > > It's not just a matter of dropping in. You'd have path searches > > to modify or disable, build options to get the static tables > > turned on, and all of this stuff would have to be integrated > > with the build system for what you're dropping it into. > > I don't see the required complexity. In fact I won't have a lib > that includes several charset definitions in a statical build. I > really like to have a directory with definition files for those > char sets and don't see the complexity for this you proclamate. > > Inclusion in statical build is not more than selection of the > charsets you want o be included statically. This selection is > always required or you include all files , which I definitly > neglect. > > > Complexity is never the solution. Honestly, I would take a 1mb > > increase in binary size over this kind of complexity any day. > > Thankfully, we don't have to make such a tradeoff. > > The only complexity which we has here is the complexity of > charset translation. The rest is relatively simple. > > > Charsets are not added. The time of charsets is over. It should > > have been over in 1992, when Pike and Thompson made them > > obsolete, but it's really over now. > > So why are you adding Japanese, Chinese and Korean charsets to an > iconv conversion in musl? Why not just using UTF-8? Whenever you > use iconv you want the flexibility to do all required charset > conversions. Which means you need to statically link in many > charset definitions or you need to dynamically load what is > required. > > > Then dynamic link it. If you want an extensible binary, you use > > dynamic linking. > > Dynamic linking of mail client, ok and where go the charset > definition files? Are they all packed into your libc.so? That is > a very big file? Why do I need to have Asian language definition > on my disk, when I do not want? > > It is your decision, but please state clear what purpose you are > building musl. Here it looks you are mixing things and steping in > a direction I will never like. > > -- > Rich [-- Attachment #2: Type: text/html, Size: 8041 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 8:24 ` Justin Cormack @ 2013-08-05 14:43 ` Rich Felker 0 siblings, 0 replies; 18+ messages in thread From: Rich Felker @ 2013-08-05 14:43 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 09:24:37AM +0100, Justin Cormack wrote: > They are not going to be "fixed" just don't build them. It is not hard with > Musl. Just add this into your build script. Indeed. My intent is for it to be fully-functional-as-shipped. If somebody needs to cripple certain interfaces to meet extreme size requirements, that's an ok local modification, and it might even be acceptable as a configure option if enough people legitimately request it. > One of the nice features of Musl is that it appeals to a broader audience > than just "embedded" so it is always going to have stuff you can cut out if > you want absolute minimalism but this means it will get wider usage. Cutting out math/*, complex/*, and most of crypt/* would save at least as much space as iconv, and there are plenty of places these aren't needed either. It's not for me to decide which options you can omit. Thankfully, due to musl's correct handling of static linking, you usually don't have to think about it either. You just static link and get only what you need. > Adding external files has many disadvantages to other people. If you don't > want these conversions external files do not help you. External files also do not make things work "by default". They only work if musl has been installed system-wide according to our directions (which not everbody will follow) or if the user has done the research to figure out how to work around it not being installed system-wide. > Making software for more than one person involves compromises so please > calm down a bit. Use your own embedded build with the parts you don't need > omitted. Exactly. Where musl excels here is by not _forcing_ you to use iconv. I take great care not to force linking of components you might not want to see in your output binary size, and for TLS, which unfortunately was misdesigned in such a way that the linker can't see if TLS is used or not for the purpose of deciding whether to link the TLS init code, I went to great lengths both to minimize the size of __init_tls.o and to make it easy, as a local customization, to omit this module. But as an analogy, I would not have even considered asking musl users who need TLS to add special CFLAGS, libraries, etc. when building programs. That's an unreasonable burden and it's broken because it does not "work by default". Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 7:53 ` Harald Becker 2013-08-05 8:24 ` Justin Cormack @ 2013-08-05 14:35 ` Rich Felker 1 sibling, 0 replies; 18+ messages in thread From: Rich Felker @ 2013-08-05 14:35 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 09:53:32AM +0200, Harald Becker wrote: > Hi Rich ! > > > iconv is not something that needs to be extensible. There is a > > finite set of legacy encodings that's relevant to the world, > > and their relevance is going to go down and down with time, not > > up. > > Oh! So you consider Japanese, Chinese, Korean, etc. languages > relevant for programs sitting on my machines? How can you decide I don't decide what's relevant for you. Rather, I don't have the authority to declare it irrelevant-by-default. This is true even for things like crypt algorithms (does anybody really want to use md5??) but especially for anything that would preclude somebody from being able to receive data in their native language. Simple multilingual support via UTF-8 with conversion from legacy data has been near top priority, if not top, since the conception of musl. If history has shown us anything, it's that universal support for all languages must be default and turning off some support to save space (which is rarely if ever actually needed) needs to be a conscious decision. I'm no Apple fan by any means, but just look at the situation on iOS: you can turn on a new iPhone or iPad and read data in any language (including having the relevant fonts!) and even add a keyboard and type in almost any language, without having to buy a special localized version or install add-ons. This is very different from the situation on Android right now. musl's intended applicability is broad. From industrial control to settop boxes, in-car entertainment, initramfs images for desktop machines, phones, tablets, plug computers that run your private home or office webmail server, full desktops, VE LAMP stacks, hosts for VEs, etc. Some of these usages have a real need for human-language text; others don't. But if we have the power to make it such that, if someone uses a musl to implement a plug computer for webmail, it naturally supports all languages unless the maker of the device goes and actively rips that support out, then we have a responsibility to do so. Or, said differently, it's OUR FAULT for making broken-by-default software if language support is missing unless you go to the effort of learning musl-specific ways to enable it. > this? Why being so ignorant and trying to write an standard > conform library and then pick out a list of char sets of your > choice which may be possible on iconv, neglecting wishes and > need of any musl user. If I were to just accept your demands, it would essentially mean: (1) discarding the opinions of everybody else who discussed this issue in the past and decided that static linking should mean real static binaries that work the same without needing extra files in the filesystem.. (2) discarding the informed decisions I made based on said discussions. > .... or in other words, if you really be this ignorant and > insist on including those charsets fixed in musl, musl is never > more for me :( ... I don't need to bring in any part of mine into > musl, but I don't consider a lib usable for my needs, which > include several char set files in statical build and neglects to > load seldom used charset definitions from extern in any way. Name the extra "seldom used charset definitions" you're interested in. They're probably already supported. We are not discussing adding some new giant subsystem to musl. We are discussing adding the last two missing major legacy charsets to an existing framework that's existed for a long time. > > > > Do I want to give users who have large volumes of legacy > > > > text in their languages stored in these encodings the same > > > > respect and dignity as users of other legacy encodings we > > > > already support? Yes. > > > > > > Of course. I won't dictate others which conversions they want > > > to use. I only hat to have plenty of conversion tables on my > > > system when I really know I never use such kind of > > > conversions. > > > > And your table for just Chinese is as large as all our tables > > combined... > > How can you tell this. I don't think so. You're welcome to implement it and see. Thanks to the way static linking works, if you add -lyouriconv when static linking, the iconv in musl will be completely omitted from the binary and yours will be used instead. Of course the iconv in musl will be completely omitted anyway except in the small number of programs that actually use iconv. This is not glibc where stdio and locale depend on iconv. iconv is purely iconv. > Such conversion codes > may be very compact. Size is mainly required for translation > tables, that is when code points of the char sets does not match > Unicode character order, but you always need the space for those > translations. The rest won't be much. That's all the size. The VAST majority of the table size is for 4 major character encoding families, those based on: - JIS 0208 - GB 18030 - KS X 1001 - Big5 As for legacy 8-bit encodings, musl's approach to them is also more efficient than you could easily be with a state machine. The fact that the number of codepoints that ever appear in an 8-bit encoding is less than 1024 is used to store the mappings as 10-bit-per-entry packed arrays of indices into the legacy_chars table. This reduces the marginal cost of individual 8bit encodings by 25% (versus 16-bit entries). The ASCII range and any span upward into the high range that maps directly to Unicode codepoints is also elided from the table (which reduces ISO-8859-* by another 62.5%). In short, what we have is about the smallest possible representation you can get without applying LZMA or something (and thereby needing all the code to decompress and dirty pages to store the decompressed version). It's hard to beat. By the way, if you really want to save the space they take, you could just delete this email thread from your mail folder. It's larger than musl's iconv already. :-) > > I agree you can make iconv smaller than musl's in the case > > where _no_ legacy DBCS are installed. But if you have just one, > > you'll be just as large or larger than musl with them all. > > .... musl with them all? I don't consider them smaller than an > optimized byte code interpreter ... not when you are going to > include DBCS char sets fixed into musl. At least if you do all > the required translations. I may have been exaggerating a little bit, but I doubt you can get your bytecode GB18030 support smaller than about 110k once you count the bytecode and the interpreter binary. I'm even more doubtful that you can get it smaller than the current 71k in musl. > > compare the size of musl's tables to glibc's converters. I've > > worked hard to make them as small as reasonably possible > > without doing hideous hacks like decompression into an > > in-memory buffer, which would actually increase bloat. > > Are you now going to build a lib for startup purpose and embedded > systems only or are you trying to write a general purpose > library? General-purpose. Have you not read the website? Originally in the 1990s, Linux-based systems used a fork of the GNU C library (glibc) version 1, which existed in various versions (libc4, libc5). Later, distributions adopted the more mature version 2 of glibc, and denoted it libc6. Since then, other specialized C library implementations such as uClibc and dietlibc have emerged as well. musl is a new general-purpose implementation of the C library. It is lightweight, fast, simple, free, and aims to be correct in the sense of standards-conformance and safety. If you're using it for startup purposes or embedded systems that don't communicate with humans in human language, you won't be running applications that call iconv() and thus it's irrelevant. > On one hand you say "use dietlibc" if you need small statical > programs and on the other hand you want to include many charset > definitions into a statical build to avoid dynamic loading of > tables, required only on embedded systems. Where did I say "use dietlibc"? If I did (I don't really remember) it was not a serious recommendation but a sarcastic remark to make a point that musl is not about being "smallest-at-all-costs" (and thereby broken) like dietlibc is. > > have been over in 1992, when Pike and Thompson made them > > obsolete, but it's really over now. > > So why are you adding Japanese, Chinese and Korean charsets to an > iconv conversion in musl? Why not just using UTF-8? Whenever you > use iconv you want the flexibility to do all required charset > conversions. Which means you need to statically link in many > charset definitions or you need to dynamically load what is > required. The time of creating charsets is over. That does not magically make the data created in those charsets in the past go away or convert itself to UTF-8. It doesn't even magically stop people from making new data in those charsets. All it means is that governments, vendors, etc. have stopped the madness of making new charsets. > > Then dynamic link it. If you want an extensible binary, you use > > dynamic linking. > > Dynamic linking of mail client, ok and where go the charset > definition files? Are they all packed into your libc.so? That is > a very big file? Why do I need to have Asian language definition > on my disk, when I do not want? Because any other solution would be larger, would defeat the purpose of static linking, and would contribute to the problem of poor multilingual support. Why are you upset about these tables and not other tables like crypto sboxes, wcwidth, character classes, bits of 2/pi and pi/2, etc.? By the way, math/*.o are also fairly large, on the same order of magnitude as iconv; would you also suggest we move it all out to bytecode loaded at runtime even in static binaries? > It is your decision, but please state clear what purpose you are > building musl. Here it looks you are mixing things and steping in > a direction I will never like. This has all been documented all along. I'm sorry you don't understand the goals of the project. Perhaps your misunderstanding is what "general purpose" means. It does not mean we omit anything that could offend anyone by wasting a few bytes on their hard drive. It means we don't cut corners that break important usage cases. Having a complete iconv linked whenever you link a program using iconv() does not break your usage case unless you have less than 100k of disk/ssd/rom storage to spare, and in that case, you probably shouldn't be using iconv. If anyone ever does have a practical difficulty because of this, rather than theoretical complaints based on anglocentricism, eurocentricism, and/or xenophobia, I am not entirely opposed to making a build option to omit iconv tables, but it has to be well-motivated. Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 16:51 Rich Felker 2013-08-04 22:39 ` Harald Becker @ 2013-08-05 0:46 ` Harald Becker 2013-08-05 5:00 ` Rich Felker 2013-08-05 8:28 ` Roy 3 siblings, 0 replies; 18+ messages in thread From: Harald Becker @ 2013-08-05 0:46 UTC (permalink / raw) Cc: musl, dalias Hi Rich, in addition to my previous message to clarify some things: 04-08-2013 12:51 Rich Felker <dalias@aerifal.cx>: > Worst-case, adding Korean and Traditional Chinese tables will > roughly double the size of iconv.o to around 150k. This will > noticably enlarge libc.so, but will make no difference to > static-linked programs except those using iconv. I'm hoping we > can make these additions less expensive, but I don't see a good > way yet. I would write iconv as a virtual machine interpreter for a very simple byte code machine. The byte code (program) of the virtual machine is just an array of unsigned bytes and the virtual machine only contains the instructions to read next byte and assemble a Unicode value or to receive a Unicode value and to produce multi byte character output. The virtual machine code itself works like a finite state machine to handle multi byte character sets. That way iconv consist of a small byte code interpreter to build the virtual machine. Then it maps the byte code from an external file for any required character set. This byte code from external file consist of virtual machine instructions and conversion tables. As this virtual machine shall be optimized for the conversion purposes, conversion operations require only interpretation of a view virtual instructions per converted character (for simple character sets, big ones may need a few more instructions). This operation is usually very fast, as not much data is involved and instructions are highly optimized for conversion operation. The virtual machine works with a data space of only a few bytes (less than 256), where some bytes need to preserve from one conversion call to next. That is conversion needs a conversion context of a few bytes (8..16). Independently from any character set conversion you want to add, you only need a single byte code interpreter for iconv, which will not increase in size. Only the external byte code / conversion table for the charsets may vary in size. Simple char sets, like Latins, consist of only a few bytes of byte code, big charsets like Japanese, Chinese and Korean, need some more byte code and may be some bigger translation tables ... but those tables are only loaded if iconv needs to access such a charset. iconv itself doesn't need to handle table of available charsets, it only converts the charset name into a filename and opens the corresponding charset translation file. On the charset file some header and version check shall handle possible installation conflicts. For any conversion request the virtual machine interpreter runs through the byte code of the requested charset and returns the conversion result. As the virtual machine shall not contain operations to violate the remainder of the system, this shall not break system security. At most the byte code is so misbehaved that it runs forever, without producing an error or any output. So the machine hangs just in an infinite loop during conversion, until the process is terminated (a simple counter may limit number of executed instructions and bail out in case of such looping). > At some point, especially if the cost is not reduced, I will > probably add build-time options to exclude a configurable > subset of the supported character encodings. This would not be > extremely fine-grained, and the choices to exclude would > probably be just: Japanese, Simplified Chinese, Traditional > Chinese, and Korean. Legacy 8-bit might also be an option but > these are so small I can't think of cases where it would be > beneficial to omit them (5k for the tables on top of the 2k of > actual code in iconv). Perhaps if there are cases where iconv > is needed purely for conversion between different Unicode > forms, but no legacy charsets, on tiny embedded devices, > dropping the 8-bit tables and all of the support code could be > useful; the resulting iconv would be around 1k, I think. You may skip all this, if iconv is constructed as a virtual machine interpreter and all character conversions are loaded from an external file. As a fallback the library may compile in the byte code for some small charset conversions, like ASCII, Latin-1, UTF-8. All other charset conversions are loaded from external resources, which may be installed or not depending on admins decision. And just added if required later. -- Harald ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 16:51 Rich Felker 2013-08-04 22:39 ` Harald Becker 2013-08-05 0:46 ` Harald Becker @ 2013-08-05 5:00 ` Rich Felker 2013-08-05 8:28 ` Roy 3 siblings, 0 replies; 18+ messages in thread From: Rich Felker @ 2013-08-05 5:00 UTC (permalink / raw) To: musl On Sun, Aug 04, 2013 at 12:51:52PM -0400, Rich Felker wrote: > Both of these have various minor extensions, but the main extensions > of any relevance seem to be: > > Korean: > CP949 > Lead byte range is extended to 81-FD (125) > Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126) > 44500 bytes table space > > Traditional Chinese: > HKSCS (CP951) > Lead byte range is extended to 88-FE (119) > 1651 characters outside BMP > 37366 bytes table space for 16-bit mapping table, plus extra mapping > needed for characters outside BMP > > The big remaining questions are: > > 1. How important are these extensions? I would guess the answer is > "fairly important", espectially for HKSCS where I believe the > additional characters are needed for encoding Cantonese words, but > it's less clear to me whether the Korean extensions are useful (they > seem to mainly be for the sake of completeness representing most/all > possible theoretical syllables that don't actually occur in words, but > this may be a naive misunderstanding on my part). For what it's worth, there is no IANA charset registration for any supplement to Korean. See the table here: http://www.iana.org/assignments/character-sets/character-sets.xhtml The only entries for Korean are ISO-2022-KR and EUC-KR. Big5-HKSCS however is registered. This matches my intuition that, of the two, HKSCS would be more important to real-world usage than Korean extensions. If we were to omit CP949 and just go with KS X 1001, but include HKSCS, the total size (minus a minimal amount of code needed) would be 17484+37366 = 54850. With both supported, it would be 44500+37366 = 81866. With just KS X 1001 and base Big5, it would be 17484+27946 = 45430. Being that HKSCS is a standard, registered MIME charset and the cost is only 10k, and that it seems necessary for real world usage in Hong Kong, I think it's pretty obvious that we should support it. So I think the question we're left with is whether the CP949 (MS encoding) extension for Korean is important to support. The cost is roughly 37k. I'm going to keep doing research to see if identifying the characters added in it sheds any light on whether there are important additions. Obviously I would like to be able to exclude it but I don't want this decision to be made unfairly based on my bias when it comes to bloat. :) Rich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 16:51 Rich Felker ` (2 preceding siblings ...) 2013-08-05 5:00 ` Rich Felker @ 2013-08-05 8:28 ` Roy 3 siblings, 0 replies; 18+ messages in thread From: Roy @ 2013-08-05 8:28 UTC (permalink / raw) To: musl Since I'm a Traditional Chinese and Japanese legacy encoding user, I think I can say something here. Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker <dalias@aerifal.cx> wrote: > OK, so here's what I've found so far. Both legacy Korean and legacy > Traditional Chinese encodings have essentially a single base character > set: > > > Traditional Chinese: > Big5 (CP950) > 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE) > All characters in BMP > 27946 bytes table space > > Both of these have various minor extensions, but the main extensions > of any relevance seem to be: > > Traditional Chinese: > HKSCS (CP951) > Lead byte range is extended to 88-FE (119) > 1651 characters outside BMP > 37366 bytes table space for 16-bit mapping table, plus extra mapping > needed for characters outside BMP > There is another Big5 extension called Big5-UAO, which is being used in world's largest telnet-based BBS called "ptt.cc". It has two tables, one for Big5-UAO to Unicode, another one is Unicode to Big5-UAO. http://moztw.org/docs/big5/table/uao250-b2u.txt http://moztw.org/docs/big5/table/uao250-u2b.txt Which extends DBCS lead byte to 0x81. > The big remaining questions are: > > 1. How important are these extensions? I would guess the answer is > "fairly important", espectially for HKSCS where I believe the > additional characters are needed for encoding Cantonese words, but > it's less clear to me whether the Korean extensions are useful (they > seem to mainly be for the sake of completeness representing most/all > possible theoretical syllables that don't actually occur in words, but > this may be a naive misunderstanding on my part). For Big5-UAO, it contains Japanese and Simplified Chinese characters which do not exist in original MS-CP950 implementation. > > 2. Are there patterns to exploit? For Korean, ALL of the Hangul > characters are actually combinations of several base letters. Unicode > encodes them all sequentially in a pattern where the conversion to > their constitutent letters is purely algorithmic, but there seems to > be no clean pattern in the legacy encodings, as the encodings started > out just incoding the "important" ones then adding less important > combinations in separate ranges. In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in Japanese) and Japanese Katakana/Hiragana besides of Hangul characters. > > Worst-case, adding Korean and Traditional Chinese tables will roughly > double the size of iconv.o to around 150k. This will noticably enlarge > libc.so, but will make no difference to static-linked programs except > those using iconv. I'm hoping we can make these additions less > expensive, but I don't see a good way yet. For static linking, can we have conditional linking like QT does? In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables. #ifndef QT_SHARED #include <QtPlugin> Q_IMPORT_PLUGIN(qcncodecs) Q_IMPORT_PLUGIN(qjpcodecs) Q_IMPORT_PLUGIN(qkrcodecs) Q_IMPORT_PLUGIN(qtwcodecs) #endif > > At some point, especially if the cost is not reduced, I will probably > add build-time options to exclude a configurable subset of the > supported character encodings. This would not be extremely > fine-grained, and the choices to exclude would probably be just: > Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy > 8-bit might also be an option but these are so small I can't think of > cases where it would be beneficial to omit them (5k for the tables on > top of the 2k of actual code in iconv). Perhaps if there are cases > where iconv is needed purely for conversion between different Unicode > forms, but no legacy charsets, on tiny embedded devices, dropping the > 8-bit tables and all of the support code could be useful; the > resulting iconv would be around 1k, I think. > > Rich > HTH, Roy ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2013-08-05 14:43 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net> 2013-08-05 12:46 ` iconv Korean and Traditional Chinese research so far Rich Felker 2013-08-04 16:51 Rich Felker 2013-08-04 22:39 ` Harald Becker 2013-08-05 0:44 ` Szabolcs Nagy 2013-08-05 1:24 ` Harald Becker 2013-08-05 3:13 ` Szabolcs Nagy 2013-08-05 7:03 ` Harald Becker 2013-08-05 12:54 ` Rich Felker 2013-08-05 0:49 ` Rich Felker 2013-08-05 1:53 ` Harald Becker 2013-08-05 3:39 ` Rich Felker 2013-08-05 7:53 ` Harald Becker 2013-08-05 8:24 ` Justin Cormack 2013-08-05 14:43 ` Rich Felker 2013-08-05 14:35 ` Rich Felker 2013-08-05 0:46 ` Harald Becker 2013-08-05 5:00 ` Rich Felker 2013-08-05 8:28 ` Roy
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).