* iconv Korean and Traditional Chinese research so far @ 2013-08-04 16:51 Rich Felker 2013-08-04 22:39 ` Harald Becker ` (3 more replies) 0 siblings, 4 replies; 26+ messages in thread From: Rich Felker @ 2013-08-04 16:51 UTC (permalink / raw) To: musl OK, so here's what I've found so far. Both legacy Korean and legacy Traditional Chinese encodings have essentially a single base character set: Korean: KS X 1001 (previously known as KS C 5601) 93 x 94 DBCS grid (A1-FD A1-FE) All characters in BMP 17484 bytes table space Traditional Chinese: Big5 (CP950) 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE) All characters in BMP 27946 bytes table space Both of these have various minor extensions, but the main extensions of any relevance seem to be: Korean: CP949 Lead byte range is extended to 81-FD (125) Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126) 44500 bytes table space Traditional Chinese: HKSCS (CP951) Lead byte range is extended to 88-FE (119) 1651 characters outside BMP 37366 bytes table space for 16-bit mapping table, plus extra mapping needed for characters outside BMP The big remaining questions are: 1. How important are these extensions? I would guess the answer is "fairly important", espectially for HKSCS where I believe the additional characters are needed for encoding Cantonese words, but it's less clear to me whether the Korean extensions are useful (they seem to mainly be for the sake of completeness representing most/all possible theoretical syllables that don't actually occur in words, but this may be a naive misunderstanding on my part). 2. Are there patterns to exploit? For Korean, ALL of the Hangul characters are actually combinations of several base letters. Unicode encodes them all sequentially in a pattern where the conversion to their constitutent letters is purely algorithmic, but there seems to be no clean pattern in the legacy encodings, as the encodings started out just incoding the "important" ones then adding less important combinations in separate ranges. Worst-case, adding Korean and Traditional Chinese tables will roughly double the size of iconv.o to around 150k. This will noticably enlarge libc.so, but will make no difference to static-linked programs except those using iconv. I'm hoping we can make these additions less expensive, but I don't see a good way yet. At some point, especially if the cost is not reduced, I will probably add build-time options to exclude a configurable subset of the supported character encodings. This would not be extremely fine-grained, and the choices to exclude would probably be just: Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy 8-bit might also be an option but these are so small I can't think of cases where it would be beneficial to omit them (5k for the tables on top of the 2k of actual code in iconv). Perhaps if there are cases where iconv is needed purely for conversion between different Unicode forms, but no legacy charsets, on tiny embedded devices, dropping the 8-bit tables and all of the support code could be useful; the resulting iconv would be around 1k, I think. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker @ 2013-08-04 22:39 ` Harald Becker 2013-08-05 0:44 ` Szabolcs Nagy 2013-08-05 0:49 ` Rich Felker 2013-08-05 0:46 ` Harald Becker ` (2 subsequent siblings) 3 siblings, 2 replies; 26+ messages in thread From: Harald Becker @ 2013-08-04 22:39 UTC (permalink / raw) Cc: musl, dalias Hi Rich ! > Worst-case, adding Korean and Traditional Chinese tables will > roughly double the size of iconv.o to around 150k. This will > noticably enlarge libc.so, but will make no difference to > static-linked programs except those using iconv. I'm hoping we > can make these additions less expensive, but I don't see a good > way yet. Oh nooo, do you really want to add this statically to the iconv version? Why cant we have all this character conversions on a state driven machine which loads its information from a external configuration file? This way we can have any kind of conversion someone likes, by just adding the configuration file for the required Unicode to X and X to Unicode conversions. State driven fsm interpreters are really small and fast and may read it's complete configuration from a file ... architecture independent file, so we may have same character conversion files for all architectures. > At some point, especially if the cost is not reduced, I will > probably add build-time options to exclude a configurable > subset of the supported character encodings. All this would go, if you do not load character conversions from a static table. Why don't you consider loading a conversion file for a given character set from predefined or configurable directory. With the name of the character set as filename. If you want to be the file in a directly read/modifiable form, you need to add a minimalistic parser, else the file contents may be considered binary data and you can just fread or mmap the file and use the data to control character set conversion. Most conversions only need minimal space, only some require bigger conversion routines. ... and for those who dislike, you just don't need to install the conversion files you do not want. -- Harald ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 22:39 ` Harald Becker @ 2013-08-05 0:44 ` Szabolcs Nagy 2013-08-05 1:24 ` Harald Becker 2013-08-05 0:49 ` Rich Felker 1 sibling, 1 reply; 26+ messages in thread From: Szabolcs Nagy @ 2013-08-05 0:44 UTC (permalink / raw) To: musl * Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]: > Why cant we have all this character conversions on a state driven > machine which loads its information from a external configuration > file? This way we can have any kind of conversion someone likes, > by just adding the configuration file for the required Unicode to > X and X to Unicode conversions. external files provided by libc can work but they should be possible to embed into the binary otherwise a static binary is not self-contained and you have to move parts of the libc around along with the binary and if they are loaded from fixed path then it does not work at all (permissions, conflicting versions etc) if the format changes then dynamic linking is problematic as well: you cannot update libc in a single atomic operation ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 0:44 ` Szabolcs Nagy @ 2013-08-05 1:24 ` Harald Becker 2013-08-05 3:13 ` Szabolcs Nagy 0 siblings, 1 reply; 26+ messages in thread From: Harald Becker @ 2013-08-05 1:24 UTC (permalink / raw) Cc: musl, nsz Hi ! 05-08-2013 02:44 Szabolcs Nagy <nsz@port70.net>: > * Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]: > > Why cant we have all this character conversions on a state > > driven machine which loads its information from a external > > configuration file? This way we can have any kind of > > conversion someone likes, by just adding the configuration > > file for the required Unicode to X and X to Unicode > > conversions. > > external files provided by libc can work but they > should be possible to embed into the binary As far as I know, does glibc create small dynamically linked objects and load those when required. This is architecture specific. So you always need conversion files which correspond to your C library. My intention is to write conversion as a machine independent byte code, which may be copied between machines of different architecture. You need a charset conversion, just add the charset bytecode to the conversion directory, which may be configurable (directory name from environ variable with default fallback). May even be a search path for conversion files, so conversion files may be installed in different locations. > otherwise a static binary is not self-contained > and you have to move parts of the libc around > along with the binary and if they are loaded > from fixed path then it does not work at all > (permissions, conflicting versions etc) Ok, I see the static linking topic, but this is no problem with byte code conversion programs. It can easily be added: Just add all the conversion byte code programs together to a single big array, with a name and offset table ahead, then link it into your program. May be done in two steps: 1) Create a selection file for musl build, and include the specified charsets in libc.a/.so 2) Select the required charset files and create an .o file to link into your program. iconv then shall: - look for some fixed charsets like ASCII, Latin-1, UTF-8, etc. - search table of with libc linked charsets - search table of with the program linked charsets - search for charset on external search path ... or do in opposite direction and use first charset conversion found. This lookup is usually very small, except file system search, so it shall not produce much overhead / bloat. [Addendum after thinking a bit more: The byte code conversion files shall exist of a small statical header, followed by the byte code program. The header shall contain the charset name, version of required virtual machine and length of byte code. So you need only add all such conversion files to a big array of bytes and add a Null header to mark the end of table. Then you only need the start of the array and you are able to search through for a specific charset. The iconv function in libc contains a definition for an "unsigned char const *iconv_user_charsets = NULL;", which is linked in, when the user does not provide it's own definition. So iconv can search all linked in charset definitions, and need no code changes. Really simple configuration to select charsets to build in.] > if the format changes then dynamic linking is > problematic as well: you cannot update libc > in a single atomic operation The byte code shall be independent of dynamic linking. The conversion files are only streams of bytes, which shall also be architecture independent. So you do only need to update the conversion files if the virtual machine definition of iconv has been changed (shall not be done much). External files may be read into malloc-ed buffers or mmap-ed, not linked in by the dynamical linker. -- Harald ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 1:24 ` Harald Becker @ 2013-08-05 3:13 ` Szabolcs Nagy 2013-08-05 7:03 ` Harald Becker 0 siblings, 1 reply; 26+ messages in thread From: Szabolcs Nagy @ 2013-08-05 3:13 UTC (permalink / raw) To: musl * Harald Becker <ralda@gmx.de> [2013-08-05 03:24:52 +0200]: > iconv then shall: > - look for some fixed charsets like ASCII, Latin-1, UTF-8, etc. > - search table of with libc linked charsets > - search table of with the program linked charsets > - search for charset on external search path sounds like a lot of extra management cost (for libc, application writer and user as well) it would be nice if the compiler could figure out at build time (eg with lto) which tables are used but i guess charsets often only known at runtime > [Addendum after thinking a bit more: The byte code conversion > files shall exist of a small statical header, followed by the > byte code program. The header shall contain the charset name, > version of required virtual machine and length of byte code. So > you need only add all such conversion files to a big array of > bytes and add a Null header to mark the end of table. Then you > only need the start of the array and you are able to search > through for a specific charset. The iconv function in libc > contains a definition for an "unsigned char const > *iconv_user_charsets = NULL;", which is linked in, when the user > does not provide it's own definition. So iconv can search all > linked in charset definitions, and need no code changes. Really > simple configuration to select charsets to build in.] > yes that can work, but it's a musl specific hack that the application programmer need to take care of > > if the format changes then dynamic linking is > > problematic as well: you cannot update libc > > in a single atomic operation > > The byte code shall be independent of dynamic linking. The > conversion files are only streams of bytes, which shall also be > architecture independent. So you do only need to update the > conversion files if the virtual machine definition of iconv has > been changed (shall not be done much). External files may be read > into malloc-ed buffers or mmap-ed, not linked in by the > dynamical linker. > that does not solve the format change problem you cannot update libc without race (unless you first replace the .so which supports the old format as well as the new one, but then libc has to support all previous formats) it's probably easy to design a fixed format to avoid this it seems somewhat similar to the timezone problem ecxept zoneinfo is maintained outside of libc so there is not much choice, but there are the same issues: updating it should be done carefully, setuid programs must be handled specially etc ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 3:13 ` Szabolcs Nagy @ 2013-08-05 7:03 ` Harald Becker 2013-08-05 12:54 ` Rich Felker 0 siblings, 1 reply; 26+ messages in thread From: Harald Becker @ 2013-08-05 7:03 UTC (permalink / raw) Cc: musl, nsz Hi ! 05-08-2013 05:13 Szabolcs Nagy <nsz@port70.net>: > * Harald Becker <ralda@gmx.de> [2013-08-05 03:24:52 +0200]: > > iconv then shall: > > - look for some fixed charsets like ASCII, Latin-1, UTF-8, > > etc. > > - search table of with libc linked charsets > > - search table of with the program linked charsets > > - search for charset on external search path > > sounds like a lot of extra management cost > (for libc, application writer and user as well) This is not so much work. You already need to search for the character set table to use, that is you need to search at least a table of string values to find the pointer to the conversion table. Searching a table in may statement above, means just waking a pointer chain doing the string compares to find a matching character set. Not much difference to the really required code. Now do this twice to check possible user chain, is just one more helper function call. The only code that get a bit more, is the file system search. This depends if we only try single location or walk through a search path list. But this is the cost of flexibility to dynamically load character set conversions (which I would really prefer for seldom used char sets). ... and for application writer it is only more, if he likes to add some charset tables into his program, which are not in statical libc. The problem is, all tables in libc need to be linked to your program, if you include iconv. So each added charset conversion increases size of your program ... and I definitly won't include Japanese, Chinese or Korean charsets in my program. No that I ignore those peoples need, I just wont need it, so I don't like to add those conversions to programs sitting on my disk. > it would be nice if the compiler could figure out > at build time (eg with lto) which tables are used > but i guess charsets often only known at runtime How do you want to do this? And how shall the compiler know which char sets the user may use during operation? So the only way to select the charset tables to include in your program, is by assuming ahead, which tables might be used. That is part of the configuration of musl build or application program build. > > [Addendum after thinking a bit more: The byte code conversion > > files shall exist of a small statical header, followed by the > > byte code program. The header shall contain the charset name, > > version of required virtual machine and length of byte code. > > So you need only add all such conversion files to a big array > > of bytes and add a Null header to mark the end of table. Then > > you only need the start of the array and you are able to > > search through for a specific charset. The iconv function in > > libc contains a definition for an "unsigned char const > > *iconv_user_charsets = NULL;", which is linked in, when the > > user does not provide it's own definition. So iconv can > > search all linked in charset definitions, and need no code > > changes. Really simple configuration to select charsets to > > build in.] > > > > yes that can work, but it's a musl specific hack > that the application programmer need to take care of Only if application programmer wants to add a char set to the statical build program, which is not in libc, some extra work has to be done. Giving some more flexibility. If you don't care, you get the musl build in list of char sets. > > > if the format changes then dynamic linking is > > > problematic as well: you cannot update libc > > > in a single atomic operation > > > > The byte code shall be independent of dynamic linking. The > > conversion files are only streams of bytes, which shall also > > be architecture independent. So you do only need to update the > > conversion files if the virtual machine definition of iconv > > has been changed (shall not be done much). External files may > > be read into malloc-ed buffers or mmap-ed, not linked in by > > the dynamical linker. > > > > that does not solve the format change problem > you cannot update libc without race > (unless you first replace the .so which supports > the old format as well as the new one, but then > libc has to support all previous formats) If the definition of the iconv virtual state machine is modified, you need to do extra care on update (delete old charset files, install new lib, install new charset files, restart system) ... but this is only required on a major update. As soon as the virtual machine definition gots stabilized you do not need to change charset definition files, or just do update your lib, then update possible new charset files. After an initial phase of testing this shall happen relatively seldom, that the virtual machine definition needs to be changed in an incompatible manner. And simple extending the virtual machine does not invalidate the old charset files. > it's probably easy to design a fixed format to > avoid this A fixed format? For what? Do you know the differences of char sets, especially multi byte char sets? > it seems somewhat similar to the timezone problem > ecxept zoneinfo is maintained outside of libc so > there is not much choice, but there are the same > issues: updating it should be done carefully, > setuid programs must be handled specially etc Again. As soon as the virtual machine definition has reached a stable state, it shall not happen much, that any change invalidates a charset definition file. That is at least old files will continue to work with newer lib versions. So there is no problem on update, just update your lib then update your charset files. The only problem will be, if a still running application uses a new charset file with an old version of the lib. This will be detected and leads to a failure code of iconv. So you need to restart your application ... which is always a good decision as you updated your lib. -- Harald ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 7:03 ` Harald Becker @ 2013-08-05 12:54 ` Rich Felker 0 siblings, 0 replies; 26+ messages in thread From: Rich Felker @ 2013-08-05 12:54 UTC (permalink / raw) To: musl; +Cc: nsz On Mon, Aug 05, 2013 at 09:03:43AM +0200, Harald Becker wrote: > The only code that get a bit more, is the file system search. > This depends if we only try single location or walk through a > search path list. But this is the cost of flexibility to > dynamically load character set conversions (which I would really > prefer for seldom used char sets). The only "seldom used char sets" are either extremely small (8bit codepages) or simply encoding variants of an existing CJK DBCS (in which case it's just a matter of code, not large data tables, to support them). > .... and for application writer it is only more, if he likes to > add some charset tables into his program, which are not in > statical libc. This is only helpful if the application writer is designing around musl. This is a practice we explicitly discourage. > The problem is, all tables in libc need to be linked to your > program, if you include iconv. So each added charset conversion > increases size of your program ... and I definitly won't include > Japanese, Chinese or Korean charsets in my program. No that I > ignore those peoples need, I just wont need it, so I don't like > to add those conversions to programs sitting on my disk. How many programs do you intend to use iconv in that _don't_ need to support arbitrary encodings including ones you might not be using yourself? Even if you don't read Korean, if a Korean user sends you an email containing non-ASCII punctuation, Greek letters like epsilon, etc. there's a fair chance their MUA will choose to encode with a legacy Korean encoding rather than UTF-8, and then you need the conversion. It would be nice if everybody encoded everything in UTF-8 so the recipient was not responsible for supporting a wide range of legacy encodings, but that's not the reality today. > If the definition of the iconv virtual state machine is modified, > you need to do extra care on update (delete old charset files, > install new lib, install new charset files, restart system) ... > but this is only required on a major update. As soon as the Even if there were really good reasons for the design you're proposing, such a violation of the stability and atomic upgrade policy would require a strong overriding justification. We don't have that here. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 22:39 ` Harald Becker 2013-08-05 0:44 ` Szabolcs Nagy @ 2013-08-05 0:49 ` Rich Felker 2013-08-05 1:53 ` Harald Becker 1 sibling, 1 reply; 26+ messages in thread From: Rich Felker @ 2013-08-05 0:49 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 12:39:43AM +0200, Harald Becker wrote: > Hi Rich ! > > > Worst-case, adding Korean and Traditional Chinese tables will > > roughly double the size of iconv.o to around 150k. This will > > noticably enlarge libc.so, but will make no difference to > > static-linked programs except those using iconv. I'm hoping we > > can make these additions less expensive, but I don't see a good > > way yet. > > Oh nooo, do you really want to add this statically to the iconv > version? Do I want to add that size? No, of course not, and that's why I'm hoping (but not optimistic) that there may be a way to elide a good part of the table based on patterns in the Hangul syllables or the possibility that the giant extensions are unimportant. Do I want to give users who have large volumes of legacy text in their languages stored in these encodings the same respect and dignity as users of other legacy encodings we already support? Yes. > Why cant we have all this character conversions on a state driven > machine which loads its information from a external configuration > file? This way we can have any kind of conversion someone likes, > by just adding the configuration file for the required Unicode to > X and X to Unicode conversions. This issue was discussed a long time ago and the consensus among users of static linking was that static linking is most valuable when it makes the binary completely "portable" to arbitrary Linux systems for the same cpu arch, without any dependency on having files in particular locations on the system aside from the minimum required by POSIX (things like /dev/null), the standard Linux /proc mountpoint, and universal config files like /etc/resolv.conf (even that is not necessary, BTW, if you have a DNS on localhost). Having iconv not work without external character tables is essentially a form of dynamic linking, and carries with it issues like where the files are to be found (you can override that with an environment variable, but that can't be permitted for setuid binaries), what happens if the format needs to change and the format on the target machine is not compatible with the libc version your binary was built with, etc. This is also the main reason musl does not support something like nss. Another side benefit of the current implementation is that it's fully self-contained and independent of any system facilities. It's pure C and can be taken out of musl and dropped in to any program on any C implementation, including freestanding (non-hosted) implementations. If it depended on the filesystem, adapting it for such usage would be a lot more work. > State driven fsm interpreters are really small and fast and may > read it's complete configuration from a file ... architecture > independent file, so we may have same character conversion files > for all architectures. A fsm implementation would be several times larger than the implementations in iconv.c. It's possible that we could, at some time in the future, support loading of user-defined character conversion files as an added feature, but this should only be for really special-purpose things like custom encodings used for games or obsolete systems (old Mac, console games, IBM mainframes, etc.). In terms of the criteria for what to include in musl itself, my idea is that if you have a mail client or web browser based on iconv for its character set handling, you should be able to read the bulk of content in any language. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 0:49 ` Rich Felker @ 2013-08-05 1:53 ` Harald Becker 2013-08-05 3:39 ` Rich Felker 0 siblings, 1 reply; 26+ messages in thread From: Harald Becker @ 2013-08-05 1:53 UTC (permalink / raw) Cc: musl, dalias Hi Rich ! 04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>: > Do I want to add that size? No, of course not, and that's why > I'm hoping (but not optimistic) that there may be a way to > elide a good part of the table based on patterns in the Hangul > syllables or the possibility that the giant extensions are > unimportant. I think there is a way for easy configuration. See other mails, they clarify what my intention is. > Do I want to give users who have large volumes of legacy text > in their languages stored in these encodings the same respect > and dignity as users of other legacy encodings we already > support? Yes. Of course. I won't dictate others which conversions they want to use. I only hat to have plenty of conversion tables on my system when I really know I never use such kind of conversions. ... but in case I really need, it can be added dynamically to the running system. > > Why cant we have all this character conversions on a state > > driven machine which loads its information from a external > > configuration file? This way we can have any kind of > > conversion someone likes, by just adding the configuration > > file for the required Unicode to X and X to Unicode > > conversions. > > This issue was discussed a long time ago and the consensus > among users of static linking was that static linking is most > valuable when it makes the binary completely "portable" to > arbitrary Linux systems for the same cpu arch, without any > dependency on having files in particular locations on the > system aside from the minimum required by POSIX (things > like /dev/null), the standard Linux /proc mountpoint, and > universal config files like /etc/resolv.conf (even that is not > necessary, BTW, if you have a DNS on localhost). Having iconv > not work without external character tables is essentially a > form of dynamic linking, and carries with it issues like where > the files are to be found (you can override that with an > environment variable, but that can't be permitted for setuid > binaries), what happens if the format needs to change and the > format on the target machine is not compatible with the libc > version your binary was built with, etc. This is also the main > reason musl does not support something like nss. I see the topic of self contained linking, and you are right that is is required, but it is fully possible to have best of both worlds without much overhead. Writing iconv as a virtual machine interpreter allows to statical link in the conversion byte code programs. Those who are not linked in, can be searched for in the filesystem. And a simple configuration option may disable file system search completely, for really small embedded operation. But beside this all conversions are the same and may be freely copied between architectures, or linked statically into a user program (just put byte stream of selected charsets into simple C array of bytes). > Another side benefit of the current implementation is that it's > fully self-contained and independent of any system facilities. > It's pure C and can be taken out of musl and dropped in to any > program on any C implementation, including freestanding > (non-hosted) implementations. If it depended on the filesystem, > adapting it for such usage would be a lot more work. The virtual machine shall be written in C, I've done such type of programming many times. So resulting code will compile with any C compiler, and byte code programs are just array of bytes, independent of machine byte order. So you will have any further dependencies. > A fsm implementation would be several times larger than the > implementations in iconv.c. A bit larger, yes ... but not so much, if virtual machine gets designed carefully, and it will not increase in size, when there are more charsets get added (only size of byte code program added). > It's possible that we could, at some time in the future, > support loading of user-defined character conversion files as > an added feature, but this should only be for really > special-purpose things like custom encodings used for games or > obsolete systems (old Mac, console games, IBM mainframes, etc.). We can have it all, with not much overhead. And it is not only for such special cases. I don't like to install musl on my systems with Japanese, Chinese or Korean conversions, but in case I really need, I'm able to throw them in, without much work. ... and we can add every character conversion on the fly, without rebuild of the library. > In terms of the criteria for what to include in musl itself, my > idea is that if you have a mail client or web browser based on > iconv for its character set handling, you should be able to > read the bulk of content in any language. If you are building a mail client or web browser, but what if you want to include the possibility of charset conversion but stay at small size, just including conversions for only system relevant conversions, but not limiting to those. Any other conversion can then be added on the fly. -- Harald ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 1:53 ` Harald Becker @ 2013-08-05 3:39 ` Rich Felker 2013-08-05 7:53 ` Harald Becker 0 siblings, 1 reply; 26+ messages in thread From: Rich Felker @ 2013-08-05 3:39 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 03:53:12AM +0200, Harald Becker wrote: > Hi Rich ! > > 04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>: > > > Do I want to add that size? No, of course not, and that's why > > I'm hoping (but not optimistic) that there may be a way to > > elide a good part of the table based on patterns in the Hangul > > syllables or the possibility that the giant extensions are > > unimportant. > > I think there is a way for easy configuration. See other mails, > they clarify what my intention is. I saw, and you're free to write such an iconv implementation if you like, but it's not right for musl. Inventing elaborate mechanisms to solve simple problems is the glibc way of doing things, not the musl way. iconv is not something that needs to be extensible. There is a finite set of legacy encodings that's relevant to the world, and their relevance is going to go down and down with time, not up. > > Do I want to give users who have large volumes of legacy text > > in their languages stored in these encodings the same respect > > and dignity as users of other legacy encodings we already > > support? Yes. > > Of course. I won't dictate others which conversions they want to > use. I only hat to have plenty of conversion tables on my system > when I really know I never use such kind of conversions. And your table for just Chinese is as large as all our tables combined... I agree you can make iconv smaller than musl's in the case where _no_ legacy DBCS are installed. But if you have just one, you'll be just as large or larger than musl with them all. Just compare the size of musl's tables to glibc's converters. I've worked hard to make them as small as reasonably possible without doing hideous hacks like decompression into an in-memory buffer, which would actually increase bloat. > ... but > in case I really need, it can be added dynamically to the running > system. If you have root or want to setup nonstandard environment variables. > > This issue was discussed a long time ago and the consensus > > among users of static linking was that static linking is most > > valuable when it makes the binary completely "portable" to > > arbitrary Linux systems for the same cpu arch, without any > > dependency on having files in particular locations on the > > system aside from the minimum required by POSIX (things > > like /dev/null), the standard Linux /proc mountpoint, and > > universal config files like /etc/resolv.conf (even that is not > > necessary, BTW, if you have a DNS on localhost). Having iconv > > not work without external character tables is essentially a > > form of dynamic linking, and carries with it issues like where > > the files are to be found (you can override that with an > > environment variable, but that can't be permitted for setuid > > binaries), what happens if the format needs to change and the > > format on the target machine is not compatible with the libc > > version your binary was built with, etc. This is also the main > > reason musl does not support something like nss. > > I see the topic of self contained linking, and you are right that > is is required, but it is fully possible to have best of both > worlds without much overhead. Writing iconv as a virtual machine It's not the best of both worlds. It's essentially the same as dynamic linking. > interpreter allows to statical link in the conversion byte code > programs. At several times the size of the current code/tables, and after the user searches through the documentation to figure out how to do it. > > Another side benefit of the current implementation is that it's > > fully self-contained and independent of any system facilities. > > It's pure C and can be taken out of musl and dropped in to any > > program on any C implementation, including freestanding > > (non-hosted) implementations. If it depended on the filesystem, > > adapting it for such usage would be a lot more work. > > The virtual machine shall be written in C, I've done such type of > programming many times. So resulting code will compile with any C > compiler, and byte code programs are just array of bytes, > independent of machine byte order. So you will have any further > dependencies. It's not just a matter of dropping in. You'd have path searches to modify or disable, build options to get the static tables turned on, and all of this stuff would have to be integrated with the build system for what you're dropping it into. Complexity is never the solution. Honestly, I would take a 1mb increase in binary size over this kind of complexity any day. Thankfully, we don't have to make such a tradeoff. > > A fsm implementation would be several times larger than the > > implementations in iconv.c. > > A bit larger, yes ... but not so much, if virtual machine gets > designed carefully, and it will not increase in size, when there > are more charsets get added (only size of byte code program > added). Charsets are not added. The time of charsets is over. It should have been over in 1992, when Pike and Thompson made them obsolete, but it's really over now. > > It's possible that we could, at some time in the future, > > support loading of user-defined character conversion files as > > an added feature, but this should only be for really > > special-purpose things like custom encodings used for games or > > obsolete systems (old Mac, console games, IBM mainframes, etc.). > > We can have it all, with not much overhead. And it is not only > for such special cases. I don't like to install musl on my > systems with Japanese, Chinese or Korean conversions, but in case > I really need, I'm able to throw them in, without much work. > > .... and we can add every character conversion on the fly, without > rebuild of the library. Maybe we should also include a bytecode interpreter for doing hostname lookups, since you might want to do something other than DNS or a hosts file. And a bytecode interpreter for user database lookups in place of passwd files. And a bytecode interpreter for adding new crypt() algorithms. And... > > In terms of the criteria for what to include in musl itself, my > > idea is that if you have a mail client or web browser based on > > iconv for its character set handling, you should be able to > > read the bulk of content in any language. > > If you are building a mail client or web browser, but what if you > want to include the possibility of charset conversion but stay at > small size, just including conversions for only system relevant > conversions, but not limiting to those. Any other conversion can > then be added on the fly. Then dynamic link it. If you want an extensible binary, you use dynamic linking. The main reason for static linking is when you want a binary whose behavior does not change with the runtime environment -- for example, for security purposes, for carrying around to other machines that don't have the same runtime environment, etc. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 3:39 ` Rich Felker @ 2013-08-05 7:53 ` Harald Becker 2013-08-05 8:24 ` Justin Cormack 2013-08-05 14:35 ` Rich Felker 0 siblings, 2 replies; 26+ messages in thread From: Harald Becker @ 2013-08-05 7:53 UTC (permalink / raw) Cc: musl, dalias Hi Rich ! > iconv is not something that needs to be extensible. There is a > finite set of legacy encodings that's relevant to the world, > and their relevance is going to go down and down with time, not > up. Oh! So you consider Japanese, Chinese, Korean, etc. languages relevant for programs sitting on my machines? How can you decide this? Why being so ignorant and trying to write an standard conform library and then pick out a list of char sets of your choice which may be possible on iconv, neglecting wishes and need of any musl user. ... or in other words, if you really be this ignorant and insist on including those charsets fixed in musl, musl is never more for me :( ... I don't need to bring in any part of mine into musl, but I don't consider a lib usable for my needs, which include several char set files in statical build and neglects to load seldom used charset definitions from extern in any way. > > > > Do I want to give users who have large volumes of legacy > > > text in their languages stored in these encodings the same > > > respect and dignity as users of other legacy encodings we > > > already support? Yes. > > > > Of course. I won't dictate others which conversions they want > > to use. I only hat to have plenty of conversion tables on my > > system when I really know I never use such kind of > > conversions. > > And your table for just Chinese is as large as all our tables > combined... How can you tell this. I don't think so. Such conversion codes may be very compact. Size is mainly required for translation tables, that is when code points of the char sets does not match Unicode character order, but you always need the space for those translations. The rest won't be much. > I agree you can make iconv smaller than musl's in the case > where _no_ legacy DBCS are installed. But if you have just one, > you'll be just as large or larger than musl with them all. ... musl with them all? I don't consider them smaller than an optimized byte code interpreter ... not when you are going to include DBCS char sets fixed into musl. At least if you do all the required translations. > compare the size of musl's tables to glibc's converters. I've > worked hard to make them as small as reasonably possible > without doing hideous hacks like decompression into an > in-memory buffer, which would actually increase bloat. Are you now going to build a lib for startup purpose and embedded systems only or are you trying to write a general purpose library? Including all those definitions in a statical build is definitely not the way I will ever like. This may be done for some special situations and selected char sets, but not for a general purpose library, claiming to get a wide usage. > If you have root or want to setup nonstandard environment > variables. What about a charset searchpath including something like "~/.local/share/charset". This would allow to install charset files in the users directory. > > interpreter allows to statical link in the conversion byte > > code programs. > > At several times the size of the current code/tables, and after > the user searches through the documentation to figure out how > to do it. You definitely consider to include all those code tables statically into musl? I won't include much more than some standard sets. Why don't you want to load the charset definitions as they are required? On one hand you say "use dietlibc" if you need small statical programs and on the other hand you want to include many charset definitions into a statical build to avoid dynamic loading of tables, required only on embedded systems. So what's the purpose of musl? I don't think you stay right here. > It's not just a matter of dropping in. You'd have path searches > to modify or disable, build options to get the static tables > turned on, and all of this stuff would have to be integrated > with the build system for what you're dropping it into. I don't see the required complexity. In fact I won't have a lib that includes several charset definitions in a statical build. I really like to have a directory with definition files for those char sets and don't see the complexity for this you proclamate. Inclusion in statical build is not more than selection of the charsets you want o be included statically. This selection is always required or you include all files , which I definitly neglect. > Complexity is never the solution. Honestly, I would take a 1mb > increase in binary size over this kind of complexity any day. > Thankfully, we don't have to make such a tradeoff. The only complexity which we has here is the complexity of charset translation. The rest is relatively simple. > Charsets are not added. The time of charsets is over. It should > have been over in 1992, when Pike and Thompson made them > obsolete, but it's really over now. So why are you adding Japanese, Chinese and Korean charsets to an iconv conversion in musl? Why not just using UTF-8? Whenever you use iconv you want the flexibility to do all required charset conversions. Which means you need to statically link in many charset definitions or you need to dynamically load what is required. > Then dynamic link it. If you want an extensible binary, you use > dynamic linking. Dynamic linking of mail client, ok and where go the charset definition files? Are they all packed into your libc.so? That is a very big file? Why do I need to have Asian language definition on my disk, when I do not want? It is your decision, but please state clear what purpose you are building musl. Here it looks you are mixing things and steping in a direction I will never like. -- Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 7:53 ` Harald Becker @ 2013-08-05 8:24 ` Justin Cormack 2013-08-05 14:43 ` Rich Felker 2013-08-05 14:35 ` Rich Felker 1 sibling, 1 reply; 26+ messages in thread From: Justin Cormack @ 2013-08-05 8:24 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 6597 bytes --] On 5 Aug 2013 08:53, "Harald Becker" <ralda@gmx.de> wrote: > > Hi Rich ! > > > iconv is not something that needs to be extensible. There is a > > finite set of legacy encodings that's relevant to the world, > > and their relevance is going to go down and down with time, not > > up. > > Oh! So you consider Japanese, Chinese, Korean, etc. languages > relevant for programs sitting on my machines? How can you decide > this? Why being so ignorant and trying to write an standard > conform library and then pick out a list of char sets of your > choice which may be possible on iconv, neglecting wishes and > need of any musl user. > > ... or in other words, if you really be this ignorant and > insist on including those charsets fixed in musl, musl is never > more for me :( ... I don't need to bring in any part of mine into > musl, but I don't consider a lib usable for my needs, which > include several char set files in statical build and neglects to > load seldom used charset definitions from extern in any way. They are not going to be "fixed" just don't build them. It is not hard with Musl. Just add this into your build script. One of the nice features of Musl is that it appeals to a broader audience than just "embedded" so it is always going to have stuff you can cut out if you want absolute minimalism but this means it will get wider usage. Adding external files has many disadvantages to other people. If you don't want these conversions external files do not help you. Making software for more than one person involves compromises so please calm down a bit. Use your own embedded build with the parts you don't need omitted. Justin > > > > > > Do I want to give users who have large volumes of legacy > > > > text in their languages stored in these encodings the same > > > > respect and dignity as users of other legacy encodings we > > > > already support? Yes. > > > > > > Of course. I won't dictate others which conversions they want > > > to use. I only hat to have plenty of conversion tables on my > > > system when I really know I never use such kind of > > > conversions. > > > > And your table for just Chinese is as large as all our tables > > combined... > > How can you tell this. I don't think so. Such conversion codes > may be very compact. Size is mainly required for translation > tables, that is when code points of the char sets does not match > Unicode character order, but you always need the space for those > translations. The rest won't be much. > > > I agree you can make iconv smaller than musl's in the case > > where _no_ legacy DBCS are installed. But if you have just one, > > you'll be just as large or larger than musl with them all. > > ... musl with them all? I don't consider them smaller than an > optimized byte code interpreter ... not when you are going to > include DBCS char sets fixed into musl. At least if you do all > the required translations. > > > compare the size of musl's tables to glibc's converters. I've > > worked hard to make them as small as reasonably possible > > without doing hideous hacks like decompression into an > > in-memory buffer, which would actually increase bloat. > > Are you now going to build a lib for startup purpose and embedded > systems only or are you trying to write a general purpose > library? Including all those definitions in a statical build is > definitely not the way I will ever like. This may be done for > some special situations and selected char sets, but not for a > general purpose library, claiming to get a wide usage. > > > If you have root or want to setup nonstandard environment > > variables. > > What about a charset searchpath including something like > "~/.local/share/charset". This would allow to install charset > files in the users directory. > > > > interpreter allows to statical link in the conversion byte > > > code programs. > > > > At several times the size of the current code/tables, and after > > the user searches through the documentation to figure out how > > to do it. > > You definitely consider to include all those code tables > statically into musl? I won't include much more than some > standard sets. Why don't you want to load the charset definitions > as they are required? > > On one hand you say "use dietlibc" if you need small statical > programs and on the other hand you want to include many charset > definitions into a statical build to avoid dynamic loading of > tables, required only on embedded systems. > > So what's the purpose of musl? I don't think you stay right here. > > > It's not just a matter of dropping in. You'd have path searches > > to modify or disable, build options to get the static tables > > turned on, and all of this stuff would have to be integrated > > with the build system for what you're dropping it into. > > I don't see the required complexity. In fact I won't have a lib > that includes several charset definitions in a statical build. I > really like to have a directory with definition files for those > char sets and don't see the complexity for this you proclamate. > > Inclusion in statical build is not more than selection of the > charsets you want o be included statically. This selection is > always required or you include all files , which I definitly > neglect. > > > Complexity is never the solution. Honestly, I would take a 1mb > > increase in binary size over this kind of complexity any day. > > Thankfully, we don't have to make such a tradeoff. > > The only complexity which we has here is the complexity of > charset translation. The rest is relatively simple. > > > Charsets are not added. The time of charsets is over. It should > > have been over in 1992, when Pike and Thompson made them > > obsolete, but it's really over now. > > So why are you adding Japanese, Chinese and Korean charsets to an > iconv conversion in musl? Why not just using UTF-8? Whenever you > use iconv you want the flexibility to do all required charset > conversions. Which means you need to statically link in many > charset definitions or you need to dynamically load what is > required. > > > Then dynamic link it. If you want an extensible binary, you use > > dynamic linking. > > Dynamic linking of mail client, ok and where go the charset > definition files? Are they all packed into your libc.so? That is > a very big file? Why do I need to have Asian language definition > on my disk, when I do not want? > > It is your decision, but please state clear what purpose you are > building musl. Here it looks you are mixing things and steping in > a direction I will never like. > > -- > Rich [-- Attachment #2: Type: text/html, Size: 8041 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 8:24 ` Justin Cormack @ 2013-08-05 14:43 ` Rich Felker 0 siblings, 0 replies; 26+ messages in thread From: Rich Felker @ 2013-08-05 14:43 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 09:24:37AM +0100, Justin Cormack wrote: > They are not going to be "fixed" just don't build them. It is not hard with > Musl. Just add this into your build script. Indeed. My intent is for it to be fully-functional-as-shipped. If somebody needs to cripple certain interfaces to meet extreme size requirements, that's an ok local modification, and it might even be acceptable as a configure option if enough people legitimately request it. > One of the nice features of Musl is that it appeals to a broader audience > than just "embedded" so it is always going to have stuff you can cut out if > you want absolute minimalism but this means it will get wider usage. Cutting out math/*, complex/*, and most of crypt/* would save at least as much space as iconv, and there are plenty of places these aren't needed either. It's not for me to decide which options you can omit. Thankfully, due to musl's correct handling of static linking, you usually don't have to think about it either. You just static link and get only what you need. > Adding external files has many disadvantages to other people. If you don't > want these conversions external files do not help you. External files also do not make things work "by default". They only work if musl has been installed system-wide according to our directions (which not everbody will follow) or if the user has done the research to figure out how to work around it not being installed system-wide. > Making software for more than one person involves compromises so please > calm down a bit. Use your own embedded build with the parts you don't need > omitted. Exactly. Where musl excels here is by not _forcing_ you to use iconv. I take great care not to force linking of components you might not want to see in your output binary size, and for TLS, which unfortunately was misdesigned in such a way that the linker can't see if TLS is used or not for the purpose of deciding whether to link the TLS init code, I went to great lengths both to minimize the size of __init_tls.o and to make it easy, as a local customization, to omit this module. But as an analogy, I would not have even considered asking musl users who need TLS to add special CFLAGS, libraries, etc. when building programs. That's an unreasonable burden and it's broken because it does not "work by default". Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-05 7:53 ` Harald Becker 2013-08-05 8:24 ` Justin Cormack @ 2013-08-05 14:35 ` Rich Felker 1 sibling, 0 replies; 26+ messages in thread From: Rich Felker @ 2013-08-05 14:35 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 09:53:32AM +0200, Harald Becker wrote: > Hi Rich ! > > > iconv is not something that needs to be extensible. There is a > > finite set of legacy encodings that's relevant to the world, > > and their relevance is going to go down and down with time, not > > up. > > Oh! So you consider Japanese, Chinese, Korean, etc. languages > relevant for programs sitting on my machines? How can you decide I don't decide what's relevant for you. Rather, I don't have the authority to declare it irrelevant-by-default. This is true even for things like crypt algorithms (does anybody really want to use md5??) but especially for anything that would preclude somebody from being able to receive data in their native language. Simple multilingual support via UTF-8 with conversion from legacy data has been near top priority, if not top, since the conception of musl. If history has shown us anything, it's that universal support for all languages must be default and turning off some support to save space (which is rarely if ever actually needed) needs to be a conscious decision. I'm no Apple fan by any means, but just look at the situation on iOS: you can turn on a new iPhone or iPad and read data in any language (including having the relevant fonts!) and even add a keyboard and type in almost any language, without having to buy a special localized version or install add-ons. This is very different from the situation on Android right now. musl's intended applicability is broad. From industrial control to settop boxes, in-car entertainment, initramfs images for desktop machines, phones, tablets, plug computers that run your private home or office webmail server, full desktops, VE LAMP stacks, hosts for VEs, etc. Some of these usages have a real need for human-language text; others don't. But if we have the power to make it such that, if someone uses a musl to implement a plug computer for webmail, it naturally supports all languages unless the maker of the device goes and actively rips that support out, then we have a responsibility to do so. Or, said differently, it's OUR FAULT for making broken-by-default software if language support is missing unless you go to the effort of learning musl-specific ways to enable it. > this? Why being so ignorant and trying to write an standard > conform library and then pick out a list of char sets of your > choice which may be possible on iconv, neglecting wishes and > need of any musl user. If I were to just accept your demands, it would essentially mean: (1) discarding the opinions of everybody else who discussed this issue in the past and decided that static linking should mean real static binaries that work the same without needing extra files in the filesystem.. (2) discarding the informed decisions I made based on said discussions. > .... or in other words, if you really be this ignorant and > insist on including those charsets fixed in musl, musl is never > more for me :( ... I don't need to bring in any part of mine into > musl, but I don't consider a lib usable for my needs, which > include several char set files in statical build and neglects to > load seldom used charset definitions from extern in any way. Name the extra "seldom used charset definitions" you're interested in. They're probably already supported. We are not discussing adding some new giant subsystem to musl. We are discussing adding the last two missing major legacy charsets to an existing framework that's existed for a long time. > > > > Do I want to give users who have large volumes of legacy > > > > text in their languages stored in these encodings the same > > > > respect and dignity as users of other legacy encodings we > > > > already support? Yes. > > > > > > Of course. I won't dictate others which conversions they want > > > to use. I only hat to have plenty of conversion tables on my > > > system when I really know I never use such kind of > > > conversions. > > > > And your table for just Chinese is as large as all our tables > > combined... > > How can you tell this. I don't think so. You're welcome to implement it and see. Thanks to the way static linking works, if you add -lyouriconv when static linking, the iconv in musl will be completely omitted from the binary and yours will be used instead. Of course the iconv in musl will be completely omitted anyway except in the small number of programs that actually use iconv. This is not glibc where stdio and locale depend on iconv. iconv is purely iconv. > Such conversion codes > may be very compact. Size is mainly required for translation > tables, that is when code points of the char sets does not match > Unicode character order, but you always need the space for those > translations. The rest won't be much. That's all the size. The VAST majority of the table size is for 4 major character encoding families, those based on: - JIS 0208 - GB 18030 - KS X 1001 - Big5 As for legacy 8-bit encodings, musl's approach to them is also more efficient than you could easily be with a state machine. The fact that the number of codepoints that ever appear in an 8-bit encoding is less than 1024 is used to store the mappings as 10-bit-per-entry packed arrays of indices into the legacy_chars table. This reduces the marginal cost of individual 8bit encodings by 25% (versus 16-bit entries). The ASCII range and any span upward into the high range that maps directly to Unicode codepoints is also elided from the table (which reduces ISO-8859-* by another 62.5%). In short, what we have is about the smallest possible representation you can get without applying LZMA or something (and thereby needing all the code to decompress and dirty pages to store the decompressed version). It's hard to beat. By the way, if you really want to save the space they take, you could just delete this email thread from your mail folder. It's larger than musl's iconv already. :-) > > I agree you can make iconv smaller than musl's in the case > > where _no_ legacy DBCS are installed. But if you have just one, > > you'll be just as large or larger than musl with them all. > > .... musl with them all? I don't consider them smaller than an > optimized byte code interpreter ... not when you are going to > include DBCS char sets fixed into musl. At least if you do all > the required translations. I may have been exaggerating a little bit, but I doubt you can get your bytecode GB18030 support smaller than about 110k once you count the bytecode and the interpreter binary. I'm even more doubtful that you can get it smaller than the current 71k in musl. > > compare the size of musl's tables to glibc's converters. I've > > worked hard to make them as small as reasonably possible > > without doing hideous hacks like decompression into an > > in-memory buffer, which would actually increase bloat. > > Are you now going to build a lib for startup purpose and embedded > systems only or are you trying to write a general purpose > library? General-purpose. Have you not read the website? Originally in the 1990s, Linux-based systems used a fork of the GNU C library (glibc) version 1, which existed in various versions (libc4, libc5). Later, distributions adopted the more mature version 2 of glibc, and denoted it libc6. Since then, other specialized C library implementations such as uClibc and dietlibc have emerged as well. musl is a new general-purpose implementation of the C library. It is lightweight, fast, simple, free, and aims to be correct in the sense of standards-conformance and safety. If you're using it for startup purposes or embedded systems that don't communicate with humans in human language, you won't be running applications that call iconv() and thus it's irrelevant. > On one hand you say "use dietlibc" if you need small statical > programs and on the other hand you want to include many charset > definitions into a statical build to avoid dynamic loading of > tables, required only on embedded systems. Where did I say "use dietlibc"? If I did (I don't really remember) it was not a serious recommendation but a sarcastic remark to make a point that musl is not about being "smallest-at-all-costs" (and thereby broken) like dietlibc is. > > have been over in 1992, when Pike and Thompson made them > > obsolete, but it's really over now. > > So why are you adding Japanese, Chinese and Korean charsets to an > iconv conversion in musl? Why not just using UTF-8? Whenever you > use iconv you want the flexibility to do all required charset > conversions. Which means you need to statically link in many > charset definitions or you need to dynamically load what is > required. The time of creating charsets is over. That does not magically make the data created in those charsets in the past go away or convert itself to UTF-8. It doesn't even magically stop people from making new data in those charsets. All it means is that governments, vendors, etc. have stopped the madness of making new charsets. > > Then dynamic link it. If you want an extensible binary, you use > > dynamic linking. > > Dynamic linking of mail client, ok and where go the charset > definition files? Are they all packed into your libc.so? That is > a very big file? Why do I need to have Asian language definition > on my disk, when I do not want? Because any other solution would be larger, would defeat the purpose of static linking, and would contribute to the problem of poor multilingual support. Why are you upset about these tables and not other tables like crypto sboxes, wcwidth, character classes, bits of 2/pi and pi/2, etc.? By the way, math/*.o are also fairly large, on the same order of magnitude as iconv; would you also suggest we move it all out to bytecode loaded at runtime even in static binaries? > It is your decision, but please state clear what purpose you are > building musl. Here it looks you are mixing things and steping in > a direction I will never like. This has all been documented all along. I'm sorry you don't understand the goals of the project. Perhaps your misunderstanding is what "general purpose" means. It does not mean we omit anything that could offend anyone by wasting a few bytes on their hard drive. It means we don't cut corners that break important usage cases. Having a complete iconv linked whenever you link a program using iconv() does not break your usage case unless you have less than 100k of disk/ssd/rom storage to spare, and in that case, you probably shouldn't be using iconv. If anyone ever does have a practical difficulty because of this, rather than theoretical complaints based on anglocentricism, eurocentricism, and/or xenophobia, I am not entirely opposed to making a build option to omit iconv tables, but it has to be well-motivated. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker 2013-08-04 22:39 ` Harald Becker @ 2013-08-05 0:46 ` Harald Becker 2013-08-05 5:00 ` Rich Felker 2013-08-05 8:28 ` Roy 3 siblings, 0 replies; 26+ messages in thread From: Harald Becker @ 2013-08-05 0:46 UTC (permalink / raw) Cc: musl, dalias Hi Rich, in addition to my previous message to clarify some things: 04-08-2013 12:51 Rich Felker <dalias@aerifal.cx>: > Worst-case, adding Korean and Traditional Chinese tables will > roughly double the size of iconv.o to around 150k. This will > noticably enlarge libc.so, but will make no difference to > static-linked programs except those using iconv. I'm hoping we > can make these additions less expensive, but I don't see a good > way yet. I would write iconv as a virtual machine interpreter for a very simple byte code machine. The byte code (program) of the virtual machine is just an array of unsigned bytes and the virtual machine only contains the instructions to read next byte and assemble a Unicode value or to receive a Unicode value and to produce multi byte character output. The virtual machine code itself works like a finite state machine to handle multi byte character sets. That way iconv consist of a small byte code interpreter to build the virtual machine. Then it maps the byte code from an external file for any required character set. This byte code from external file consist of virtual machine instructions and conversion tables. As this virtual machine shall be optimized for the conversion purposes, conversion operations require only interpretation of a view virtual instructions per converted character (for simple character sets, big ones may need a few more instructions). This operation is usually very fast, as not much data is involved and instructions are highly optimized for conversion operation. The virtual machine works with a data space of only a few bytes (less than 256), where some bytes need to preserve from one conversion call to next. That is conversion needs a conversion context of a few bytes (8..16). Independently from any character set conversion you want to add, you only need a single byte code interpreter for iconv, which will not increase in size. Only the external byte code / conversion table for the charsets may vary in size. Simple char sets, like Latins, consist of only a few bytes of byte code, big charsets like Japanese, Chinese and Korean, need some more byte code and may be some bigger translation tables ... but those tables are only loaded if iconv needs to access such a charset. iconv itself doesn't need to handle table of available charsets, it only converts the charset name into a filename and opens the corresponding charset translation file. On the charset file some header and version check shall handle possible installation conflicts. For any conversion request the virtual machine interpreter runs through the byte code of the requested charset and returns the conversion result. As the virtual machine shall not contain operations to violate the remainder of the system, this shall not break system security. At most the byte code is so misbehaved that it runs forever, without producing an error or any output. So the machine hangs just in an infinite loop during conversion, until the process is terminated (a simple counter may limit number of executed instructions and bail out in case of such looping). > At some point, especially if the cost is not reduced, I will > probably add build-time options to exclude a configurable > subset of the supported character encodings. This would not be > extremely fine-grained, and the choices to exclude would > probably be just: Japanese, Simplified Chinese, Traditional > Chinese, and Korean. Legacy 8-bit might also be an option but > these are so small I can't think of cases where it would be > beneficial to omit them (5k for the tables on top of the 2k of > actual code in iconv). Perhaps if there are cases where iconv > is needed purely for conversion between different Unicode > forms, but no legacy charsets, on tiny embedded devices, > dropping the 8-bit tables and all of the support code could be > useful; the resulting iconv would be around 1k, I think. You may skip all this, if iconv is constructed as a virtual machine interpreter and all character conversions are loaded from an external file. As a fallback the library may compile in the byte code for some small charset conversions, like ASCII, Latin-1, UTF-8. All other charset conversions are loaded from external resources, which may be installed or not depending on admins decision. And just added if required later. -- Harald ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker 2013-08-04 22:39 ` Harald Becker 2013-08-05 0:46 ` Harald Becker @ 2013-08-05 5:00 ` Rich Felker 2013-08-05 8:28 ` Roy 3 siblings, 0 replies; 26+ messages in thread From: Rich Felker @ 2013-08-05 5:00 UTC (permalink / raw) To: musl On Sun, Aug 04, 2013 at 12:51:52PM -0400, Rich Felker wrote: > Both of these have various minor extensions, but the main extensions > of any relevance seem to be: > > Korean: > CP949 > Lead byte range is extended to 81-FD (125) > Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126) > 44500 bytes table space > > Traditional Chinese: > HKSCS (CP951) > Lead byte range is extended to 88-FE (119) > 1651 characters outside BMP > 37366 bytes table space for 16-bit mapping table, plus extra mapping > needed for characters outside BMP > > The big remaining questions are: > > 1. How important are these extensions? I would guess the answer is > "fairly important", espectially for HKSCS where I believe the > additional characters are needed for encoding Cantonese words, but > it's less clear to me whether the Korean extensions are useful (they > seem to mainly be for the sake of completeness representing most/all > possible theoretical syllables that don't actually occur in words, but > this may be a naive misunderstanding on my part). For what it's worth, there is no IANA charset registration for any supplement to Korean. See the table here: http://www.iana.org/assignments/character-sets/character-sets.xhtml The only entries for Korean are ISO-2022-KR and EUC-KR. Big5-HKSCS however is registered. This matches my intuition that, of the two, HKSCS would be more important to real-world usage than Korean extensions. If we were to omit CP949 and just go with KS X 1001, but include HKSCS, the total size (minus a minimal amount of code needed) would be 17484+37366 = 54850. With both supported, it would be 44500+37366 = 81866. With just KS X 1001 and base Big5, it would be 17484+27946 = 45430. Being that HKSCS is a standard, registered MIME charset and the cost is only 10k, and that it seems necessary for real world usage in Hong Kong, I think it's pretty obvious that we should support it. So I think the question we're left with is whether the CP949 (MS encoding) extension for Korean is important to support. The cost is roughly 37k. I'm going to keep doing research to see if identifying the characters added in it sheds any light on whether there are important additions. Obviously I would like to be able to exclude it but I don't want this decision to be made unfairly based on my bias when it comes to bloat. :) Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: iconv Korean and Traditional Chinese research so far 2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker ` (2 preceding siblings ...) 2013-08-05 5:00 ` Rich Felker @ 2013-08-05 8:28 ` Roy 2013-08-05 15:43 ` Rich Felker 2013-08-05 19:12 ` Rich Felker 3 siblings, 2 replies; 26+ messages in thread From: Roy @ 2013-08-05 8:28 UTC (permalink / raw) To: musl Since I'm a Traditional Chinese and Japanese legacy encoding user, I think I can say something here. Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker <dalias@aerifal.cx> wrote: > OK, so here's what I've found so far. Both legacy Korean and legacy > Traditional Chinese encodings have essentially a single base character > set: > > > Traditional Chinese: > Big5 (CP950) > 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE) > All characters in BMP > 27946 bytes table space > > Both of these have various minor extensions, but the main extensions > of any relevance seem to be: > > Traditional Chinese: > HKSCS (CP951) > Lead byte range is extended to 88-FE (119) > 1651 characters outside BMP > 37366 bytes table space for 16-bit mapping table, plus extra mapping > needed for characters outside BMP > There is another Big5 extension called Big5-UAO, which is being used in world's largest telnet-based BBS called "ptt.cc". It has two tables, one for Big5-UAO to Unicode, another one is Unicode to Big5-UAO. http://moztw.org/docs/big5/table/uao250-b2u.txt http://moztw.org/docs/big5/table/uao250-u2b.txt Which extends DBCS lead byte to 0x81. > The big remaining questions are: > > 1. How important are these extensions? I would guess the answer is > "fairly important", espectially for HKSCS where I believe the > additional characters are needed for encoding Cantonese words, but > it's less clear to me whether the Korean extensions are useful (they > seem to mainly be for the sake of completeness representing most/all > possible theoretical syllables that don't actually occur in words, but > this may be a naive misunderstanding on my part). For Big5-UAO, it contains Japanese and Simplified Chinese characters which do not exist in original MS-CP950 implementation. > > 2. Are there patterns to exploit? For Korean, ALL of the Hangul > characters are actually combinations of several base letters. Unicode > encodes them all sequentially in a pattern where the conversion to > their constitutent letters is purely algorithmic, but there seems to > be no clean pattern in the legacy encodings, as the encodings started > out just incoding the "important" ones then adding less important > combinations in separate ranges. In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in Japanese) and Japanese Katakana/Hiragana besides of Hangul characters. > > Worst-case, adding Korean and Traditional Chinese tables will roughly > double the size of iconv.o to around 150k. This will noticably enlarge > libc.so, but will make no difference to static-linked programs except > those using iconv. I'm hoping we can make these additions less > expensive, but I don't see a good way yet. For static linking, can we have conditional linking like QT does? In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables. #ifndef QT_SHARED #include <QtPlugin> Q_IMPORT_PLUGIN(qcncodecs) Q_IMPORT_PLUGIN(qjpcodecs) Q_IMPORT_PLUGIN(qkrcodecs) Q_IMPORT_PLUGIN(qtwcodecs) #endif > > At some point, especially if the cost is not reduced, I will probably > add build-time options to exclude a configurable subset of the > supported character encodings. This would not be extremely > fine-grained, and the choices to exclude would probably be just: > Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy > 8-bit might also be an option but these are so small I can't think of > cases where it would be beneficial to omit them (5k for the tables on > top of the 2k of actual code in iconv). Perhaps if there are cases > where iconv is needed purely for conversion between different Unicode > forms, but no legacy charsets, on tiny embedded devices, dropping the > 8-bit tables and all of the support code could be useful; the > resulting iconv would be around 1k, I think. > > Rich > HTH, Roy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-05 8:28 ` Roy @ 2013-08-05 15:43 ` Rich Felker 2013-08-05 17:31 ` Rich Felker 2013-08-05 19:12 ` Rich Felker 1 sibling, 1 reply; 26+ messages in thread From: Rich Felker @ 2013-08-05 15:43 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote: > Since I'm a Traditional Chinese and Japanese legacy encoding user, I > think I can say something here. Great, thanks for joining in with some constructive input! :) > >Traditional Chinese: > >HKSCS (CP951) > >Lead byte range is extended to 88-FE (119) > >1651 characters outside BMP > >37366 bytes table space for 16-bit mapping table, plus extra mapping > >needed for characters outside BMP > > There is another Big5 extension called Big5-UAO, which is being used > in world's largest telnet-based BBS called "ptt.cc". > > It has two tables, one for Big5-UAO to Unicode, another one is > Unicode to Big5-UAO. > http://moztw.org/docs/big5/table/uao250-b2u.txt > http://moztw.org/docs/big5/table/uao250-u2b.txt > > Which extends DBCS lead byte to 0x81. Is it a superset of HKSCS or does it assign different characters to the range covered by HKSCS? > In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji > characters in Japanese) and Japanese Katakana/Hiragana besides of > Hangul characters. Yes, I'm aware of these. However, it looks to me like the only characters outside the standard 94x94 grid zone are Hangul syllables, and they appear in codepoint order. If so, even if there's not a good pattern to where they're located, merely knowing that the ones that are missing from the 94x94 grid are placed in order in the expanded space is sufficient to perform algorithmic (albeit inefficient) conversion. Does this sound correct? > >Worst-case, adding Korean and Traditional Chinese tables will roughly > >double the size of iconv.o to around 150k. This will noticably enlarge > >libc.so, but will make no difference to static-linked programs except > >those using iconv. I'm hoping we can make these additions less > >expensive, but I don't see a good way yet. > > For static linking, can we have conditional linking like QT does? My feeling is that it's a tradeoff, and probably has more pros than cons. Unlike QT, musl's iconv is extremely small. Even with all the above, the size of iconv.o will be under 130k, maybe closer to 110k. If you actually use iconv in your program, this is a small price to pay for having it fully functional. On the other hand, if linking it is conditional, you have to consider who makes the decision, and when. If it's at link time for each application, that's probably too much of a musl-specific version. If it's at build time for musl, then is it your device vendor deciding for you what languages you need? One of the biggest headaches of uClibc-based systems is finding that the system libc was built with important options you need turned off and that you need to hack in a replacement to get something working... I think the cost of getting stuck with broken binaries where charsets were omitted is sufficiently greater than the cost of adding a few tens of kb to static binaries using iconv, that we should only consider a build time option if embedded users are actively reporting size problems. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-05 15:43 ` Rich Felker @ 2013-08-05 17:31 ` Rich Felker 0 siblings, 0 replies; 26+ messages in thread From: Rich Felker @ 2013-08-05 17:31 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 11:43:45AM -0400, Rich Felker wrote: > > In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji > > characters in Japanese) and Japanese Katakana/Hiragana besides of > > Hangul characters. > > Yes, I'm aware of these. However, it looks to me like the only > characters outside the standard 94x94 grid zone are Hangul syllables, > and they appear in codepoint order. If so, even if there's not a good > pattern to where they're located, merely knowing that the ones that > are missing from the 94x94 grid are placed in order in the expanded > space is sufficient to perform algorithmic (albeit inefficient) > conversion. Does this sound correct? I've verified that this is correct and committed an implementation of Korean based on this principle, which I basically copied from my current implementation of GB18030's support for arbitrary Unicode codepoints. It has not been heavily tested but I did test it casually with all the important boundary values and it seems correct. Tests should probably be added to the test suite. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-05 8:28 ` Roy 2013-08-05 15:43 ` Rich Felker @ 2013-08-05 19:12 ` Rich Felker 2013-08-06 6:14 ` Roy 1 sibling, 1 reply; 26+ messages in thread From: Rich Felker @ 2013-08-05 19:12 UTC (permalink / raw) To: musl On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote: > Since I'm a Traditional Chinese and Japanese legacy encoding user, I > think I can say something here. > [...] > There is another Big5 extension called Big5-UAO, which is being used > in world's largest telnet-based BBS called "ptt.cc". > > It has two tables, one for Big5-UAO to Unicode, another one is > Unicode to Big5-UAO. > http://moztw.org/docs/big5/table/uao250-b2u.txt > http://moztw.org/docs/big5/table/uao250-u2b.txt > > Which extends DBCS lead byte to 0x81. OK, I've been trying to do some research on this and I turned up: http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0061.html http://lists.gnu.org/archive/html/bug-gnu-libiconv/2010-11/msg00007.html My impression (please correct me if I'm wrong) is that you can't use Big5-UAO as the system encoding on modern versions of Windows (just ancient ones where you install unmaintained third-party software that hacks the system charset tables) and that it's not supported in GNU libiconv. If this is the case, and especially if Big5-UAO's main use is on a telnet-based BBS where everybody is using special telnet clients that have their own Big5-UAO converters, I'd find it really hard to justify trying to support this. But I'm open to hearing arguments on why we should, if you believe it's important. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-05 19:12 ` Rich Felker @ 2013-08-06 6:14 ` Roy 2013-08-06 13:32 ` Rich Felker 0 siblings, 1 reply; 26+ messages in thread From: Roy @ 2013-08-06 6:14 UTC (permalink / raw) To: musl Tue, 06 Aug 2013 03:12:47 +0800, Rich Felker <dalias@aerifal.cx> wrote: > On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote: >> Since I'm a Traditional Chinese and Japanese legacy encoding user, I >> think I can say something here. >> [...] >> There is another Big5 extension called Big5-UAO, which is being used >> in world's largest telnet-based BBS called "ptt.cc". >> >> It has two tables, one for Big5-UAO to Unicode, another one is >> Unicode to Big5-UAO. >> http://moztw.org/docs/big5/table/uao250-b2u.txt >> http://moztw.org/docs/big5/table/uao250-u2b.txt >> >> Which extends DBCS lead byte to 0x81. > > OK, I've been trying to do some research on this and I turned up: > > http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0061.html > http://lists.gnu.org/archive/html/bug-gnu-libiconv/2010-11/msg00007.html > > My impression (please correct me if I'm wrong) is that you can't use > Big5-UAO as the system encoding on modern versions of Windows (just > ancient ones where you install unmaintained third-party software that > hacks the system charset tables) It doesn't "hack" the nls file but replaces with UAO-available CP950 nls file. The executable(setup program) is generated with NSIS(Nullsoft Scriptable Install System). Since the nls file format doesn't change since NT 3.1 in 1993 till now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will continue to work in newer versions of windows unless MS throw away nls file format with something different. > and that it's not supported in GNU > libiconv. If this is the case, and especially if Big5-UAO's main use > is on a telnet-based BBS where everybody is using special telnet > clients that have their own Big5-UAO converters, GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful SBCS+DBCS)! So does it matter if GNU libiconv is not support whatever encodings? (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC SBCS and stateful SBCS+DBCS encodings) > I'd find it really > hard to justify trying to support this. But I'm open to hearing > arguments on why we should, if you believe it's important. I think it will be nice to have build/link time option for those "unpopular" encodings. >> For static linking, can we have conditional linking like QT does? > > My feeling is that it's a tradeoff, and probably has more pros than > cons. Unlike QT, musl's iconv is extremely small. I would add "right now" here. When we adds more encoding later, iconv module will be bigger than now, and people will need to find a way to conditionally compiling the encoding they need (for both dynamically or statically) > Even with all the > above, the size of iconv.o will be under 130k, maybe closer to 110k. > If you actually use iconv in your program, this is a small price to > pay for having it fully functional. On the other hand, if linking it > is conditional, you have to consider who makes the decision, and when. > If it's at link time for each application, that's probably too much of > a musl-specific version. Since statically linking libc-iconv is new area now (other libc doesn't touch this topic much), I think we can create standard for statically linking specified encoding table in link time. (This is also a reason of "why libc should provide an unique identifier with preprocessor define") > If it's at build time for musl, then is it > your device vendor deciding for you what languages you need? One of > the biggest headaches of uClibc-based systems is finding that the > system libc was built with important options you need turned off and > that you need to hack in a replacement to get something working... > > I think the cost of getting stuck with broken binaries where charsets > were omitted is sufficiently greater than the cost of adding a few > tens of kb to static binaries using iconv, that we should only > consider a build time option if embedded users are actively reporting > size problems. > > Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-06 6:14 ` Roy @ 2013-08-06 13:32 ` Rich Felker 2013-08-06 15:11 ` Roy 0 siblings, 1 reply; 26+ messages in thread From: Rich Felker @ 2013-08-06 13:32 UTC (permalink / raw) To: musl On Tue, Aug 06, 2013 at 02:14:33PM +0800, Roy wrote: > >My impression (please correct me if I'm wrong) is that you can't use > >Big5-UAO as the system encoding on modern versions of Windows (just > >ancient ones where you install unmaintained third-party software that > >hacks the system charset tables) > > It doesn't "hack" the nls file but replaces with UAO-available CP950 > nls file. > The executable(setup program) is generated with NSIS(Nullsoft > Scriptable Install System). > Since the nls file format doesn't change since NT 3.1 in 1993 till > now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will > continue to work in newer versions of windows unless MS throw away > nls file format with something different. OK, thanks for clarifying. I'd still consider it a ways into the "hack" domain if the OS vendor still is not supporting it directly, but it does make a difference that it still works "cleanly". I was under the impression that these sorts of things changes between Windows versions in ways that would preclude using old, unmaintained patches like this. I agree that just the fact that certain OS vendors do not support an encoding is not in itself a reason not to support it. > >and that it's not supported in GNU > >libiconv. If this is the case, and especially if Big5-UAO's main use > >is on a telnet-based BBS where everybody is using special telnet > >clients that have their own Big5-UAO converters, > > GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful > SBCS+DBCS)! > > So does it matter if GNU libiconv is not support whatever encodings? > (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC > SBCS and stateful SBCS+DBCS encodings) I was under the impression that GNU libiconv was in sync with glibc's iconv, but I have not checked this. I actually was more interested in glibc's, which is in widespread use. glibc's inclusion or exclusion of a feature is not in itself a reason to include or exclude it, but supporting something that glibc supports does have the added motivation that it will increase compatibility with what programs are expecting. > >I'd find it really > >hard to justify trying to support this. But I'm open to hearing > >arguments on why we should, if you believe it's important. > > I think it will be nice to have build/link time option for those > "unpopular" encodings. > > >>For static linking, can we have conditional linking like QT does? > > > >My feeling is that it's a tradeoff, and probably has more pros than > >cons. Unlike QT, musl's iconv is extremely small. > > I would add "right now" here. When we adds more encoding later, > iconv module will be bigger than now, and people will need to find a > way to conditionally compiling the encoding they need (for both > dynamically or statically) It's never been my intent to add more encodings later (aside from pure non-table-based variants of existing ones, like the ISO-2022 versions) once coverage is complete, at least not as built-in features. This can be discussed if you think there are reasons it needs to change, but up until now, the plan has been to support: - ISO-8859 based 8-bit encodings - Other 8-bit encodings with actual legacy usage (mainly Cyrillic) - JIS 0208 based encodings - KS X 1001 based encodings - GB 2312 and supersets - Big5 and supersets All of those except Big5 and supersets are now supported, so short of any change, my position is that right now we're discussing the "last" significant addition to musl's iconv. Some things that are definitely outside the scope of musl's iconv: - Anything whose characters are not present in Unicode - Anything PUA-based (really, same as above) - Newly invented encodings with no historical encoded data What's more borderline is where UAO falls: encodings that have neither governmental or language-body-authority support nor any vendor support from other software vendors, but for which there is at least one major corpus of historical data and/or current usage for the encoding by users of the language(s) whose characters are encoded. However, based on the file at http://moztw.org/docs/big5/table/uao250-b2u.txt a number of the mappings UAO defines are into the private use area. This would generally preclude support (as this is a font-specific encoding, not a Unicode encoding) unless the affected characters have since been added to Unicode and could be remapped to the correct codepoints. Do you know the status on this? I'm also still unclear on whether this is a superset of HKSCS (it's definitely not directly, but maybe it is if the PUA mappings are corrected; I did not do any detaield checks but just noted the lack of mappings to the non-BMP codepoints HKSCS uses). > >Even with all the > >above, the size of iconv.o will be under 130k, maybe closer to 110k. > >If you actually use iconv in your program, this is a small price to > >pay for having it fully functional. On the other hand, if linking it > >is conditional, you have to consider who makes the decision, and when. > >If it's at link time for each application, that's probably too much of > >a musl-specific version. > > Since statically linking libc-iconv is new area now (other libc > doesn't touch this topic much), I think we can create standard for > statically linking specified encoding table in link time. > (This is also a reason of "why libc should provide an unique > identifier with preprocessor define") I don't see how "creating a standard" for doing this would make the situation any better. Most software authors these days are at best tolerant of the existing of static linking, and more often hostile to it. They're not going to add specific build behavior for static linking, and even if they do, they're likely to get it wrong, in which case the user ends up stuck with binaries that can't process input in their language. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-06 13:32 ` Rich Felker @ 2013-08-06 15:11 ` Roy 2013-08-06 16:22 ` Rich Felker 0 siblings, 1 reply; 26+ messages in thread From: Roy @ 2013-08-06 15:11 UTC (permalink / raw) To: musl On Tue, 06 Aug 2013 21:32:05 +0800, Rich Felker <dalias@aerifal.cx> wrote: > On Tue, Aug 06, 2013 at 02:14:33PM +0800, Roy wrote: >> >My impression (please correct me if I'm wrong) is that you can't use >> >Big5-UAO as the system encoding on modern versions of Windows (just >> >ancient ones where you install unmaintained third-party software that >> >hacks the system charset tables) >> >> It doesn't "hack" the nls file but replaces with UAO-available CP950 >> nls file. >> The executable(setup program) is generated with NSIS(Nullsoft >> Scriptable Install System). >> Since the nls file format doesn't change since NT 3.1 in 1993 till >> now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will >> continue to work in newer versions of windows unless MS throw away >> nls file format with something different. > > OK, thanks for clarifying. I'd still consider it a ways into the > "hack" domain if the OS vendor still is not supporting it directly, > but it does make a difference that it still works "cleanly". I was > under the impression that these sorts of things changes between > Windows versions in ways that would preclude using old, unmaintained > patches like this. I agree that just the fact that certain OS vendors > do not support an encoding is not in itself a reason not to support > it. > >> >and that it's not supported in GNU >> >libiconv. If this is the case, and especially if Big5-UAO's main use >> >is on a telnet-based BBS where everybody is using special telnet >> >clients that have their own Big5-UAO converters, >> >> GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful >> SBCS+DBCS)! >> >> So does it matter if GNU libiconv is not support whatever encodings? >> (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC >> SBCS and stateful SBCS+DBCS encodings) > > I was under the impression that GNU libiconv was in sync with glibc's > iconv, but I have not checked this. I actually was more interested in > glibc's, which is in widespread use. glibc's inclusion or exclusion of > a feature is not in itself a reason to include or exclude it, but > supporting something that glibc supports does have the added > motivation that it will increase compatibility with what programs are > expecting. > >> >I'd find it really >> >hard to justify trying to support this. But I'm open to hearing >> >arguments on why we should, if you believe it's important. >> >> I think it will be nice to have build/link time option for those >> "unpopular" encodings. >> >> >>For static linking, can we have conditional linking like QT does? >> > >> >My feeling is that it's a tradeoff, and probably has more pros than >> >cons. Unlike QT, musl's iconv is extremely small. >> >> I would add "right now" here. When we adds more encoding later, >> iconv module will be bigger than now, and people will need to find a >> way to conditionally compiling the encoding they need (for both >> dynamically or statically) > > It's never been my intent to add more encodings later (aside from pure > non-table-based variants of existing ones, like the ISO-2022 versions) > once coverage is complete, at least not as built-in features. This can > be discussed if you think there are reasons it needs to change, but up > until now, the plan has been to support: > > - ISO-8859 based 8-bit encodings > - Other 8-bit encodings with actual legacy usage (mainly Cyrillic) > - JIS 0208 based encodings > - KS X 1001 based encodings > - GB 2312 and supersets > - Big5 and supersets > > All of those except Big5 and supersets are now supported, so short of > any change, my position is that right now we're discussing the "last" > significant addition to musl's iconv. > > Some things that are definitely outside the scope of musl's iconv: > > - Anything whose characters are not present in Unicode > - Anything PUA-based (really, same as above) > - Newly invented encodings with no historical encoded data > > What's more borderline is where UAO falls: encodings that have neither > governmental or language-body-authority support nor any vendor support > from other software vendors, but for which there is at least one major > corpus of historical data and/or current usage for the encoding by > users of the language(s) whose characters are encoded. > > However, based on the file at > > http://moztw.org/docs/big5/table/uao250-b2u.txt > > a number of the mappings UAO defines are into the private use area. > This would generally preclude support (as this is a font-specific > encoding, not a Unicode encoding) unless the affected characters have > since been added to Unicode and could be remapped to the correct > codepoints. Do you know the status on this? Those are Big5-2003 compatibility code range. Big5-2003 is in CNS11643 appendix section, but it is rarely used since no OS/Application supports it. So skipping the PUA mappings are fine. > > I'm also still unclear on whether this is a superset of HKSCS (it's > definitely not directly, but maybe it is if the PUA mappings are > corrected; I did not do any detaield checks but just noted the lack of > mappings to the non-BMP codepoints HKSCS uses). No it isn't. There is some code conflict between HKSCS(2001/2004) and UAO. > >> >Even with all the >> >above, the size of iconv.o will be under 130k, maybe closer to 110k. >> >If you actually use iconv in your program, this is a small price to >> >pay for having it fully functional. On the other hand, if linking it >> >is conditional, you have to consider who makes the decision, and when. >> >If it's at link time for each application, that's probably too much of >> >a musl-specific version. >> >> Since statically linking libc-iconv is new area now (other libc >> doesn't touch this topic much), I think we can create standard for >> statically linking specified encoding table in link time. >> (This is also a reason of "why libc should provide an unique >> identifier with preprocessor define") > > I don't see how "creating a standard" for doing this would make the > situation any better. Most software authors these days are at best > tolerant of the existing of static linking, and more often hostile to > it. They're not going to add specific build behavior for static > linking, and even if they do, they're likely to get it wrong, in which > case the user ends up stuck with binaries that can't process input in > their language. > > Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-06 15:11 ` Roy @ 2013-08-06 16:22 ` Rich Felker 2013-08-07 0:54 ` Roy 0 siblings, 1 reply; 26+ messages in thread From: Rich Felker @ 2013-08-06 16:22 UTC (permalink / raw) To: musl On Tue, Aug 06, 2013 at 11:11:23PM +0800, Roy wrote: > >However, based on the file at > > > >http://moztw.org/docs/big5/table/uao250-b2u.txt > > > >a number of the mappings UAO defines are into the private use area. > >This would generally preclude support (as this is a font-specific > >encoding, not a Unicode encoding) unless the affected characters have > >since been added to Unicode and could be remapped to the correct > >codepoints. Do you know the status on this? > > Those are Big5-2003 compatibility code range. Big5-2003 is in > CNS11643 appendix section, but it is rarely used since no > OS/Application supports it. > So skipping the PUA mappings are fine. OK, a few more questions... 1. What, if anything, is the accepted charset name for Big5-UAO, i.e. how would it appear in MIME headers, etc.? 2. Can you give me an idea of the relationship between the Big5 variants/extensions/supersets? I'm aware of Windows CP950, HKSCS, and now UAO. Is CP950 a common subset of them all, or is there a smaller base subset "plain Big5" that's the only shared part? What is ETEN and how does it fit in? 3. How should different MIME charset names be handled? In particular, what does plain "Big5" refer to? Should it be interpreted as CP950? 4. Is there anywhere to get clean semi-authoritative sources for the definitions of these charsets in plain text form. For HKSCS I found a government PDF file but it's useless because you can't extract the data in any meaningful way. Unicode has the CP950 file and "BIG5" file, but the latter refers to Unicode 1.1 in the comments and I've heard claims that it's completely wrong on many issues. Unihan.txt is also fairly useless because it only defines the mappings for ideographic characters, not the rest of the mappings in legacy CJK encodings. Short of anything better I may just have to use glibc output as a reference... > >I'm also still unclear on whether this is a superset of HKSCS (it's > >definitely not directly, but maybe it is if the PUA mappings are > >corrected; I did not do any detaield checks but just noted the lack of > >mappings to the non-BMP codepoints HKSCS uses). > > No it isn't. There is some code conflict between HKSCS(2001/2004) and UAO. Some conflict or heavy conflict? From an implementation standpoint, I want to know if this is something where they could use a common table plus "if (type==BIG5UAO) { /* fixups here */ ... }" or if they need completely separate tables. Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-06 16:22 ` Rich Felker @ 2013-08-07 0:54 ` Roy 2013-08-07 7:20 ` Roy 0 siblings, 1 reply; 26+ messages in thread From: Roy @ 2013-08-07 0:54 UTC (permalink / raw) To: musl On Wed, 07 Aug 2013 00:22:15 +0800, Rich Felker <dalias@aerifal.cx> wrote: > On Tue, Aug 06, 2013 at 11:11:23PM +0800, Roy wrote: >> >However, based on the file at >> > >> >http://moztw.org/docs/big5/table/uao250-b2u.txt >> > >> >a number of the mappings UAO defines are into the private use area. >> >This would generally preclude support (as this is a font-specific >> >encoding, not a Unicode encoding) unless the affected characters have >> >since been added to Unicode and could be remapped to the correct >> >codepoints. Do you know the status on this? >> >> Those are Big5-2003 compatibility code range. Big5-2003 is in >> CNS11643 appendix section, but it is rarely used since no >> OS/Application supports it. >> So skipping the PUA mappings are fine. > > OK, a few more questions... > > 1. What, if anything, is the accepted charset name for Big5-UAO, i.e. > how would it appear in MIME headers, etc.? No. Actually all Big5 variants uses "big5". > > 2. Can you give me an idea of the relationship between the Big5 > variants/extensions/supersets? I'm aware of Windows CP950, HKSCS, and > now UAO. Is CP950 a common subset of them all, or is there a smaller > base subset "plain Big5" that's the only shared part? What is ETEN and > how does it fit in? MS-CP950 can be considered as a common subset of HKSCS/UAO/ETEN etc. Big5-ETEN mostly looks like CP950 but with Japanese Katakana/Hiragana area etc. > > 3. How should different MIME charset names be handled? In particular, > what does plain "Big5" refer to? Should it be interpreted as CP950? Since they use same MIME name, it depends on System codepage. Some Hong Kong news websites still use Big5-HKSCS. For people using Internet Explorer with HKSCS installed, big5 MIME will map to Big5-HKSCS(or say, the only CP950 entry is mapped to CP951.nls which is HKSCS) For Firefox users, they have to choose Big5-HKSCS by hand or by extension which checks domain name. > > 4. Is there anywhere to get clean semi-authoritative sources for the > definitions of these charsets in plain text form. For HKSCS I found a > government PDF file but it's useless because you can't extract the > data in any meaningful way. Unicode has the CP950 file and "BIG5" > file, but the latter refers to Unicode 1.1 in the comments and I've > heard claims that it's completely wrong on many issues. Unihan.txt is > also fairly useless because it only defines the mappings for > ideographic characters, not the rest of the mappings in legacy CJK > encodings. Short of anything better I may just have to use glibc > output as a reference... There is a documentation created by Mozilla Taiwan community: http://moztw.org/docs/big5/ Google Translate: http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=zh-TW&ie=UTF-8&u=http%3A%2F%2Fmoztw.org%2Fdocs%2Fbig5%2F > >> >I'm also still unclear on whether this is a superset of HKSCS (it's >> >definitely not directly, but maybe it is if the PUA mappings are >> >corrected; I did not do any detaield checks but just noted the lack of >> >mappings to the non-BMP codepoints HKSCS uses). >> >> No it isn't. There is some code conflict between HKSCS(2001/2004) and >> UAO. > > Some conflict or heavy conflict? From an implementation standpoint, I > want to know if this is something where they could use a common table > plus "if (type==BIG5UAO) { /* fixups here */ ... }" or if they need > completely separate tables. Big5-HKSCS 2004 map for reference: http://moztw.org/docs/big5/table/hkscs2004.txt Use sed and awk to create b2u.txt for comparing: $ sed -e '/^==/d' -e '1,2d' hkscs2004.txt| awk 'BEGIN{print "# big5 unicode"}{print "0x" $1 " 0x" $4}' > hkscs2004-b2u.txt In result: http://roy.dnsd.me/hkscs2004-b2u.txt And finally the diff: http://roy.dnsd.me/uao250-hkscs2004.diff The diff is huge so separated table is needed. > > Rich ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Re: Re: Re: iconv Korean and Traditional Chinese research so far 2013-08-07 0:54 ` Roy @ 2013-08-07 7:20 ` Roy 0 siblings, 0 replies; 26+ messages in thread From: Roy @ 2013-08-07 7:20 UTC (permalink / raw) To: musl On Wed, 07 Aug 2013 08:54:35 +0800, Roy <roytam@gmail.com> wrote: [snip] > > Big5-HKSCS 2004 map for reference: > http://moztw.org/docs/big5/table/hkscs2004.txt > Use sed and awk to create b2u.txt for comparing: > $ sed -e '/^==/d' -e '1,2d' hkscs2004.txt| awk 'BEGIN{print "# big5 > unicode"}{print "0x" $1 " 0x" $4}' > hkscs2004-b2u.txt > In result: > http://roy.dnsd.me/hkscs2004-b2u.txt > > And finally the diff: > http://roy.dnsd.me/uao250-hkscs2004.diff > > The diff is huge so separated table is needed. I forgot that the HKSCS table has original CP950 entries missing. $ cat cp950-b2u.txt hkscs2004-b2u.txt | sed -e '1d'|sort > hkscs2004-big5-b2u.txt And I wrote a small utility in PHP to compare 2 tables by keys(first column): http://roy.dnsd.me/tbldiff.phps $ php tbldiff.php uao250-b2u.txt hkscs2004-big5-b2u.txt > uao250-vs-hkscs2004.txt http://roy.dnsd.me/uao250-vs-hkscs2004.txt $ sed -e '/==/d' uao250-vs-hkscs2004.txt > uao250-hkscs2004-diff.txt http://roy.dnsd.me/uao250-hkscs2004-diff.txt So 5965 mappings are different, including 1379 mappings does not exist in HKSCS2004. But since there is mix-usage of HKSCS2001/2004 in both local files and Internet pages, the condition of HKSCS become worse. BTW, There is another NLS hack that hacks MS-CP932 to support JIS X 0213:2004 http://www.eonet.ne.jp/~kotobukispace/ddt/jisx0213/jisx0213.html ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2013-08-07 7:20 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker 2013-08-04 22:39 ` Harald Becker 2013-08-05 0:44 ` Szabolcs Nagy 2013-08-05 1:24 ` Harald Becker 2013-08-05 3:13 ` Szabolcs Nagy 2013-08-05 7:03 ` Harald Becker 2013-08-05 12:54 ` Rich Felker 2013-08-05 0:49 ` Rich Felker 2013-08-05 1:53 ` Harald Becker 2013-08-05 3:39 ` Rich Felker 2013-08-05 7:53 ` Harald Becker 2013-08-05 8:24 ` Justin Cormack 2013-08-05 14:43 ` Rich Felker 2013-08-05 14:35 ` Rich Felker 2013-08-05 0:46 ` Harald Becker 2013-08-05 5:00 ` Rich Felker 2013-08-05 8:28 ` Roy 2013-08-05 15:43 ` Rich Felker 2013-08-05 17:31 ` Rich Felker 2013-08-05 19:12 ` Rich Felker 2013-08-06 6:14 ` Roy 2013-08-06 13:32 ` Rich Felker 2013-08-06 15:11 ` Roy 2013-08-06 16:22 ` Rich Felker 2013-08-07 0:54 ` Roy 2013-08-07 7:20 ` Roy
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).