iconv Korean and Traditional Chinese research so far

mailing list of musl libc
 help / color / mirror / code / Atom feed

* iconv Korean and Traditional Chinese research so far
@ 2013-08-04 16:51 Rich Felker
  2013-08-04 22:39 ` Harald Becker
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Rich Felker @ 2013-08-04 16:51 UTC (permalink / raw)
  To: musl

OK, so here's what I've found so far. Both legacy Korean and legacy
Traditional Chinese encodings have essentially a single base character
set:

Korean:
KS X 1001 (previously known as KS C 5601)
93 x 94 DBCS grid (A1-FD A1-FE)
All characters in BMP
17484 bytes table space

Traditional Chinese:
Big5 (CP950)
89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
All characters in BMP
27946 bytes table space

Both of these have various minor extensions, but the main extensions
of any relevance seem to be:

Korean:
CP949
Lead byte range is extended to 81-FD (125)
Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
44500 bytes table space

Traditional Chinese:
HKSCS (CP951)
Lead byte range is extended to 88-FE (119)
1651 characters outside BMP
37366 bytes table space for 16-bit mapping table, plus extra mapping
needed for characters outside BMP

The big remaining questions are:

1. How important are these extensions? I would guess the answer is
"fairly important", espectially for HKSCS where I believe the
additional characters are needed for encoding Cantonese words, but
it's less clear to me whether the Korean extensions are useful (they
seem to mainly be for the sake of completeness representing most/all
possible theoretical syllables that don't actually occur in words, but
this may be a naive misunderstanding on my part).

2. Are there patterns to exploit? For Korean, ALL of the Hangul
characters are actually combinations of several base letters. Unicode
encodes them all sequentially in a pattern where the conversion to
their constitutent letters is purely algorithmic, but there seems to
be no clean pattern in the legacy encodings, as the encodings started
out just incoding the "important" ones then adding less important
combinations in separate ranges.

Worst-case, adding Korean and Traditional Chinese tables will roughly
double the size of iconv.o to around 150k. This will noticably enlarge
libc.so, but will make no difference to static-linked programs except
those using iconv. I'm hoping we can make these additions less
expensive, but I don't see a good way yet.

At some point, especially if the cost is not reduced, I will probably
add build-time options to exclude a configurable subset of the
supported character encodings. This would not be extremely
fine-grained, and the choices to exclude would probably be just:
Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
8-bit might also be an option but these are so small I can't think of
cases where it would be beneficial to omit them (5k for the tables on
top of the 2k of actual code in iconv). Perhaps if there are cases
where iconv is needed purely for conversion between different Unicode
forms, but no legacy charsets, on tiny embedded devices, dropping the
8-bit tables and all of the support code could be useful; the
resulting iconv would be around 1k, I think.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker
@ 2013-08-04 22:39 ` Harald Becker
  2013-08-05  0:44   ` Szabolcs Nagy
  2013-08-05  0:49   ` Rich Felker
  2013-08-05  0:46 ` Harald Becker
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Harald Becker @ 2013-08-04 22:39 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich !

> Worst-case, adding Korean and Traditional Chinese tables will
> roughly double the size of iconv.o to around 150k. This will
> noticably enlarge libc.so, but will make no difference to
> static-linked programs except those using iconv. I'm hoping we
> can make these additions less expensive, but I don't see a good
> way yet.

Oh nooo, do you really want to add this statically to the iconv
version?

Why cant we have all this character conversions on a state driven
machine which loads its information from a external configuration
file? This way we can have any kind of conversion someone likes,
by just adding the configuration file for the required Unicode to
X and X to Unicode conversions.

State driven fsm interpreters are really small and fast and may
read it's complete configuration from a file ... architecture
independent file, so we may have same character conversion files
for all architectures.

> At some point, especially if the cost is not reduced, I will
> probably add build-time options to exclude a configurable
> subset of the supported character encodings.

All this would go, if you do not load character conversions from
a static table. Why don't you consider loading a conversion
file for a given character set from predefined or configurable
directory. With the name of the character set as filename. If you
want to be the file in a directly read/modifiable form, you need
to add a minimalistic parser, else the file contents may be
considered binary data and you can just fread or mmap the file
and use the data to control character set conversion. Most
conversions only need minimal space, only some require bigger
conversion routines. ... and for those who dislike, you just
don't need to install the conversion files you do not want.

--
Harald

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 22:39 ` Harald Becker
@ 2013-08-05  0:44   ` Szabolcs Nagy
  2013-08-05  1:24     ` Harald Becker
  2013-08-05  0:49   ` Rich Felker
  1 sibling, 1 reply; 26+ messages in thread
From: Szabolcs Nagy @ 2013-08-05  0:44 UTC (permalink / raw)
  To: musl

* Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]:
> Why cant we have all this character conversions on a state driven
> machine which loads its information from a external configuration
> file? This way we can have any kind of conversion someone likes,
> by just adding the configuration file for the required Unicode to
> X and X to Unicode conversions.

external files provided by libc can work but they
should be possible to embed into the binary

otherwise a static binary is not self-contained
and you have to move parts of the libc around
along with the binary and if they are loaded
from fixed path then it does not work at all
(permissions, conflicting versions etc)

if the format changes then dynamic linking is
problematic as well: you cannot update libc
in a single atomic operation


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  0:44   ` Szabolcs Nagy
@ 2013-08-05  1:24     ` Harald Becker
  2013-08-05  3:13       ` Szabolcs Nagy
  0 siblings, 1 reply; 26+ messages in thread
From: Harald Becker @ 2013-08-05  1:24 UTC (permalink / raw)
  Cc: musl, nsz

Hi !

05-08-2013 02:44 Szabolcs Nagy <nsz@port70.net>:

> * Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]:
> > Why cant we have all this character conversions on a state
> > driven machine which loads its information from a external
> > configuration file? This way we can have any kind of
> > conversion someone likes, by just adding the configuration
> > file for the required Unicode to X and X to Unicode
> > conversions.
> 
> external files provided by libc can work but they
> should be possible to embed into the binary

As far as I know, does glibc create small dynamically linked
objects and load those when required. This is architecture
specific. So you always need conversion files which correspond
to your C library.

My intention is to write conversion as a machine independent byte
code, which may be copied between machines of different
architecture. You need a charset conversion, just add the charset
bytecode to the conversion directory, which may be configurable
(directory name from environ variable with default fallback). May
even be a search path for conversion files, so conversion files
may be installed in different locations.

> otherwise a static binary is not self-contained
> and you have to move parts of the libc around
> along with the binary and if they are loaded
> from fixed path then it does not work at all
> (permissions, conflicting versions etc)

Ok, I see the static linking topic, but this is no problem with
byte code conversion programs. It can easily be added: Just add
all the conversion byte code programs together to a single big
array, with a name and offset table ahead, then link it into your
program.

May be done in two steps:

1) Create a selection file for musl build, and include the
specified charsets in libc.a/.so

2) Select the required charset files and create an .o file to
link into your program.

iconv then shall:
- look for some fixed charsets like ASCII, Latin-1, UTF-8, etc.
- search table of with libc linked charsets
- search table of with the program linked charsets
- search for charset on external search path

... or do in opposite direction and use first charset
conversion found.

This lookup is usually very small, except file system search, so
it shall not produce much overhead / bloat.

[Addendum after thinking a bit more: The byte code conversion
files shall exist of a small statical header, followed by the
byte code program. The header shall contain the charset name,
version of required virtual machine and length of byte code. So
you need only add all such conversion files to a big array of
bytes and add a Null header to mark the end of table. Then you
only need the start of the array and you are able to search
through for a specific charset. The iconv function in libc
contains a definition for an "unsigned char const
*iconv_user_charsets = NULL;", which is linked in, when the user
does not provide it's own definition. So iconv can search all
linked in charset definitions, and need no code changes. Really
simple configuration to select charsets to build in.]

> if the format changes then dynamic linking is
> problematic as well: you cannot update libc
> in a single atomic operation

The byte code shall be independent of dynamic linking. The
conversion files are only streams of bytes, which shall also be
architecture independent. So you do only need to update the
conversion files if the virtual machine definition of iconv has
been changed (shall not be done much). External files may be read
into malloc-ed buffers or mmap-ed, not linked in by the
dynamical linker.

--
Harald

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  1:24     ` Harald Becker
@ 2013-08-05  3:13       ` Szabolcs Nagy
  2013-08-05  7:03         ` Harald Becker
  0 siblings, 1 reply; 26+ messages in thread
From: Szabolcs Nagy @ 2013-08-05  3:13 UTC (permalink / raw)
  To: musl

* Harald Becker <ralda@gmx.de> [2013-08-05 03:24:52 +0200]:
> iconv then shall:
> - look for some fixed charsets like ASCII, Latin-1, UTF-8, etc.
> - search table of with libc linked charsets
> - search table of with the program linked charsets
> - search for charset on external search path

sounds like a lot of extra management cost
(for libc, application writer and user as well)

it would be nice if the compiler could figure out
at build time (eg with lto) which tables are used
but i guess charsets often only known at runtime

> [Addendum after thinking a bit more: The byte code conversion
> files shall exist of a small statical header, followed by the
> byte code program. The header shall contain the charset name,
> version of required virtual machine and length of byte code. So
> you need only add all such conversion files to a big array of
> bytes and add a Null header to mark the end of table. Then you
> only need the start of the array and you are able to search
> through for a specific charset. The iconv function in libc
> contains a definition for an "unsigned char const
> *iconv_user_charsets = NULL;", which is linked in, when the user
> does not provide it's own definition. So iconv can search all
> linked in charset definitions, and need no code changes. Really
> simple configuration to select charsets to build in.]
> 

yes that can work, but it's a musl specific hack
that the application programmer need to take care of

> > if the format changes then dynamic linking is
> > problematic as well: you cannot update libc
> > in a single atomic operation
> 
> The byte code shall be independent of dynamic linking. The
> conversion files are only streams of bytes, which shall also be
> architecture independent. So you do only need to update the
> conversion files if the virtual machine definition of iconv has
> been changed (shall not be done much). External files may be read
> into malloc-ed buffers or mmap-ed, not linked in by the
> dynamical linker.
> 

that does not solve the format change problem
you cannot update libc without race
(unless you first replace the .so which supports
the old format as well as the new one, but then
libc has to support all previous formats)

it's probably easy to design a fixed format to
avoid this

it seems somewhat similar to the timezone problem
ecxept zoneinfo is maintained outside of libc so
there is not much choice, but there are the same
issues: updating it should be done carefully,
setuid programs must be handled specially etc


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  3:13       ` Szabolcs Nagy
@ 2013-08-05  7:03         ` Harald Becker
  2013-08-05 12:54           ` Rich Felker
  0 siblings, 1 reply; 26+ messages in thread
From: Harald Becker @ 2013-08-05  7:03 UTC (permalink / raw)
  Cc: musl, nsz

Hi !

05-08-2013 05:13 Szabolcs Nagy <nsz@port70.net>:

> * Harald Becker <ralda@gmx.de> [2013-08-05 03:24:52 +0200]:
> > iconv then shall:
> > - look for some fixed charsets like ASCII, Latin-1, UTF-8,
> > etc.
> > - search table of with libc linked charsets
> > - search table of with the program linked charsets
> > - search for charset on external search path
> 
> sounds like a lot of extra management cost
> (for libc, application writer and user as well)

This is not so much work. You already need to search for the
character set table to use, that is you need to search at least a
table of string values to find the pointer to the conversion
table. Searching a table in may statement above, means just
waking a pointer chain doing the string compares to find a
matching character set. Not much difference to the really
required code. Now do this twice to check possible user chain,
is just one more helper function call.

The only code that get a bit more, is the file system search.
This depends if we only try single location or walk through a
search path list. But this is the cost of flexibility to
dynamically load character set conversions (which I would really
prefer for seldom used char sets).

... and for application writer it is only more, if he likes to
add some charset tables into his program, which are not in
statical libc.

The problem is, all tables in libc need to be linked to your
program, if you include iconv. So each added charset conversion
increases size of your program ... and I definitly won't include
Japanese, Chinese or Korean charsets in my program. No that I
ignore those peoples need, I just wont need it, so I don't like
to add those conversions to programs sitting on my disk.

> it would be nice if the compiler could figure out
> at build time (eg with lto) which tables are used
> but i guess charsets often only known at runtime

How do you want to do this? And how shall the compiler know which
char sets the user may use during operation? So the only way to
select the charset tables to include in your program, is by
assuming ahead, which tables might be used. That is part of the
configuration of musl build or application program build.

> > [Addendum after thinking a bit more: The byte code conversion
> > files shall exist of a small statical header, followed by the
> > byte code program. The header shall contain the charset name,
> > version of required virtual machine and length of byte code.
> > So you need only add all such conversion files to a big array
> > of bytes and add a Null header to mark the end of table. Then
> > you only need the start of the array and you are able to
> > search through for a specific charset. The iconv function in
> > libc contains a definition for an "unsigned char const
> > *iconv_user_charsets = NULL;", which is linked in, when the
> > user does not provide it's own definition. So iconv can
> > search all linked in charset definitions, and need no code
> > changes. Really simple configuration to select charsets to
> > build in.]
> > 
> 
> yes that can work, but it's a musl specific hack
> that the application programmer need to take care of

Only if application programmer wants to add a char set to the
statical build program, which is not in libc, some extra work
has to be done. Giving some more flexibility. If you don't care,
you get the musl build in list of char sets.

> > > if the format changes then dynamic linking is
> > > problematic as well: you cannot update libc
> > > in a single atomic operation
> > 
> > The byte code shall be independent of dynamic linking. The
> > conversion files are only streams of bytes, which shall also
> > be architecture independent. So you do only need to update the
> > conversion files if the virtual machine definition of iconv
> > has been changed (shall not be done much). External files may
> > be read into malloc-ed buffers or mmap-ed, not linked in by
> > the dynamical linker.
> > 
> 
> that does not solve the format change problem
> you cannot update libc without race
> (unless you first replace the .so which supports
> the old format as well as the new one, but then
> libc has to support all previous formats)

If the definition of the iconv virtual state machine is modified,
you need to do extra care on update (delete old charset files,
install new lib, install new charset files, restart system) ...
but this is only required on a major update. As soon as the
virtual machine definition gots stabilized you do not need to
change charset definition files, or just do update your lib, then
update possible new charset files. After an initial phase of
testing this shall happen relatively seldom, that the virtual
machine definition needs to be changed in an incompatible manner.
And simple extending the virtual machine does not invalidate the
old charset files.

> it's probably easy to design a fixed format to
> avoid this

A fixed format? For what? Do you know the differences of char
sets, especially multi byte char sets?

> it seems somewhat similar to the timezone problem
> ecxept zoneinfo is maintained outside of libc so
> there is not much choice, but there are the same
> issues: updating it should be done carefully,
> setuid programs must be handled specially etc

Again. As soon as the virtual machine definition has reached a
stable state, it shall not happen much, that any change
invalidates a charset definition file. That is at least old files
will continue to work with newer lib versions. So there is no
problem on update, just update your lib then update your charset
files. The only problem will be, if a still running application
uses a new charset file with an old version of the lib. This will
be detected and leads to a failure code of iconv. So you need to
restart your application ... which is always a good decision as
you updated your lib.

--
Harald

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  7:03         ` Harald Becker
@ 2013-08-05 12:54           ` Rich Felker
  0 siblings, 0 replies; 26+ messages in thread
From: Rich Felker @ 2013-08-05 12:54 UTC (permalink / raw)
  To: musl; +Cc: nsz

On Mon, Aug 05, 2013 at 09:03:43AM +0200, Harald Becker wrote:
> The only code that get a bit more, is the file system search.
> This depends if we only try single location or walk through a
> search path list. But this is the cost of flexibility to
> dynamically load character set conversions (which I would really
> prefer for seldom used char sets).

The only "seldom used char sets" are either extremely small (8bit
codepages) or simply encoding variants of an existing CJK DBCS (in
which case it's just a matter of code, not large data tables, to
support them).

> .... and for application writer it is only more, if he likes to
> add some charset tables into his program, which are not in
> statical libc.

This is only helpful if the application writer is designing around
musl. This is a practice we explicitly discourage.

> The problem is, all tables in libc need to be linked to your
> program, if you include iconv. So each added charset conversion
> increases size of your program ... and I definitly won't include
> Japanese, Chinese or Korean charsets in my program. No that I
> ignore those peoples need, I just wont need it, so I don't like
> to add those conversions to programs sitting on my disk.

How many programs do you intend to use iconv in that _don't_ need to
support arbitrary encodings including ones you might not be using
yourself? Even if you don't read Korean, if a Korean user sends you an
email containing non-ASCII punctuation, Greek letters like epsilon,
etc. there's a fair chance their MUA will choose to encode with a
legacy Korean encoding rather than UTF-8, and then you need the
conversion.

It would be nice if everybody encoded everything in UTF-8 so the
recipient was not responsible for supporting a wide range of legacy
encodings, but that's not the reality today.

> If the definition of the iconv virtual state machine is modified,
> you need to do extra care on update (delete old charset files,
> install new lib, install new charset files, restart system) ...
> but this is only required on a major update. As soon as the

Even if there were really good reasons for the design you're
proposing, such a violation of the stability and atomic upgrade policy
would require a strong overriding justification. We don't have that
here.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 22:39 ` Harald Becker
  2013-08-05  0:44   ` Szabolcs Nagy
@ 2013-08-05  0:49   ` Rich Felker
  2013-08-05  1:53     ` Harald Becker
  1 sibling, 1 reply; 26+ messages in thread
From: Rich Felker @ 2013-08-05  0:49 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 12:39:43AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> > Worst-case, adding Korean and Traditional Chinese tables will
> > roughly double the size of iconv.o to around 150k. This will
> > noticably enlarge libc.so, but will make no difference to
> > static-linked programs except those using iconv. I'm hoping we
> > can make these additions less expensive, but I don't see a good
> > way yet.
> 
> Oh nooo, do you really want to add this statically to the iconv
> version?

Do I want to add that size? No, of course not, and that's why I'm
hoping (but not optimistic) that there may be a way to elide a good
part of the table based on patterns in the Hangul syllables or the
possibility that the giant extensions are unimportant.

Do I want to give users who have large volumes of legacy text in their
languages stored in these encodings the same respect and dignity as
users of other legacy encodings we already support? Yes.

> Why cant we have all this character conversions on a state driven
> machine which loads its information from a external configuration
> file? This way we can have any kind of conversion someone likes,
> by just adding the configuration file for the required Unicode to
> X and X to Unicode conversions.

This issue was discussed a long time ago and the consensus among users
of static linking was that static linking is most valuable when it
makes the binary completely "portable" to arbitrary Linux systems for
the same cpu arch, without any dependency on having files in
particular locations on the system aside from the minimum required by
POSIX (things like /dev/null), the standard Linux /proc mountpoint,
and universal config files like /etc/resolv.conf (even that is not
necessary, BTW, if you have a DNS on localhost). Having iconv not work
without external character tables is essentially a form of dynamic
linking, and carries with it issues like where the files are to be
found (you can override that with an environment variable, but that
can't be permitted for setuid binaries), what happens if the format
needs to change and the format on the target machine is not compatible
with the libc version your binary was built with, etc. This is also
the main reason musl does not support something like nss.

Another side benefit of the current implementation is that it's fully
self-contained and independent of any system facilities. It's pure C
and can be taken out of musl and dropped in to any program on any C
implementation, including freestanding (non-hosted) implementations.
If it depended on the filesystem, adapting it for such usage would be
a lot more work.

> State driven fsm interpreters are really small and fast and may
> read it's complete configuration from a file ... architecture
> independent file, so we may have same character conversion files
> for all architectures.

A fsm implementation would be several times larger than the
implementations in iconv.c. It's possible that we could, at some time
in the future, support loading of user-defined character conversion
files as an added feature, but this should only be for really
special-purpose things like custom encodings used for games or
obsolete systems (old Mac, console games, IBM mainframes, etc.).

In terms of the criteria for what to include in musl itself, my idea
is that if you have a mail client or web browser based on iconv for
its character set handling, you should be able to read the bulk of
content in any language.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  0:49   ` Rich Felker
@ 2013-08-05  1:53     ` Harald Becker
  2013-08-05  3:39       ` Rich Felker
  0 siblings, 1 reply; 26+ messages in thread
From: Harald Becker @ 2013-08-05  1:53 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich !

04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>:

> Do I want to add that size? No, of course not, and that's why
> I'm hoping (but not optimistic) that there may be a way to
> elide a good part of the table based on patterns in the Hangul
> syllables or the possibility that the giant extensions are
> unimportant.

I think there is a way for easy configuration. See other mails,
they clarify what my intention is.

> Do I want to give users who have large volumes of legacy text
> in their languages stored in these encodings the same respect
> and dignity as users of other legacy encodings we already
> support? Yes.

Of course. I won't dictate others which conversions they want to
use. I only hat to have plenty of conversion tables on my system
when I really know I never use such kind of conversions. ... but
in case I really need, it can be added dynamically to the running
system.

> > Why cant we have all this character conversions on a state
> > driven machine which loads its information from a external
> > configuration file? This way we can have any kind of
> > conversion someone likes, by just adding the configuration
> > file for the required Unicode to X and X to Unicode
> > conversions.
> 
> This issue was discussed a long time ago and the consensus
> among users of static linking was that static linking is most
> valuable when it makes the binary completely "portable" to
> arbitrary Linux systems for the same cpu arch, without any
> dependency on having files in particular locations on the
> system aside from the minimum required by POSIX (things
> like /dev/null), the standard Linux /proc mountpoint, and
> universal config files like /etc/resolv.conf (even that is not
> necessary, BTW, if you have a DNS on localhost). Having iconv
> not work without external character tables is essentially a
> form of dynamic linking, and carries with it issues like where
> the files are to be found (you can override that with an
> environment variable, but that can't be permitted for setuid
> binaries), what happens if the format needs to change and the
> format on the target machine is not compatible with the libc
> version your binary was built with, etc. This is also the main
> reason musl does not support something like nss.

I see the topic of self contained linking, and you are right that
is is required, but it is fully possible to have best of both
worlds without much overhead. Writing iconv as a virtual machine
interpreter allows to statical link in the conversion byte code
programs. Those who are not linked in, can be searched for in the
filesystem. And a simple configuration option may disable file
system search completely, for really small embedded operation.
But beside this all conversions are the same and may be
freely copied between architectures, or linked statically into a
user program (just put byte stream of selected charsets into
simple C array of bytes).

> Another side benefit of the current implementation is that it's
> fully self-contained and independent of any system facilities.
> It's pure C and can be taken out of musl and dropped in to any
> program on any C implementation, including freestanding
> (non-hosted) implementations. If it depended on the filesystem,
> adapting it for such usage would be a lot more work.

The virtual machine shall be written in C, I've done such type of
programming many times. So resulting code will compile with any C
compiler, and byte code programs are just array of bytes,
independent of machine byte order. So you will have any further
dependencies.

> A fsm implementation would be several times larger than the
> implementations in iconv.c.

A bit larger, yes ... but not so much, if virtual machine gets
designed carefully, and it will not increase in size, when there
are more charsets get added (only size of byte code program
added).

> It's possible that we could, at some time in the future,
> support loading of user-defined character conversion files as
> an added feature, but this should only be for really
> special-purpose things like custom encodings used for games or
> obsolete systems (old Mac, console games, IBM mainframes, etc.).

We can have it all, with not much overhead. And it is not only
for such special cases. I don't like to install musl on my
systems with Japanese, Chinese or Korean conversions, but in case
I really need, I'm able to throw them in, without much work.

... and we can add every character conversion on the fly, without
rebuild of the library.

> In terms of the criteria for what to include in musl itself, my
> idea is that if you have a mail client or web browser based on
> iconv for its character set handling, you should be able to
> read the bulk of content in any language.

If you are building a mail client or web browser, but what if you
want to include the possibility of charset conversion but stay at
small size, just including conversions for only system relevant
conversions, but not limiting to those. Any other conversion can
then be added on the fly.

--
Harald

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  1:53     ` Harald Becker
@ 2013-08-05  3:39       ` Rich Felker
  2013-08-05  7:53         ` Harald Becker
  0 siblings, 1 reply; 26+ messages in thread
From: Rich Felker @ 2013-08-05  3:39 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 03:53:12AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> 04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>:
> 
> > Do I want to add that size? No, of course not, and that's why
> > I'm hoping (but not optimistic) that there may be a way to
> > elide a good part of the table based on patterns in the Hangul
> > syllables or the possibility that the giant extensions are
> > unimportant.
> 
> I think there is a way for easy configuration. See other mails,
> they clarify what my intention is.

I saw, and you're free to write such an iconv implementation if you
like, but it's not right for musl. Inventing elaborate mechanisms to
solve simple problems is the glibc way of doing things, not the musl
way.

iconv is not something that needs to be extensible. There is a finite
set of legacy encodings that's relevant to the world, and their
relevance is going to go down and down with time, not up.

> > Do I want to give users who have large volumes of legacy text
> > in their languages stored in these encodings the same respect
> > and dignity as users of other legacy encodings we already
> > support? Yes.
> 
> Of course. I won't dictate others which conversions they want to
> use. I only hat to have plenty of conversion tables on my system
> when I really know I never use such kind of conversions.

And your table for just Chinese is as large as all our tables
combined...

I agree you can make iconv smaller than musl's in the case where _no_
legacy DBCS are installed. But if you have just one, you'll be just as
large or larger than musl with them all. Just compare the size of
musl's tables to glibc's converters. I've worked hard to make them as
small as reasonably possible without doing hideous hacks like
decompression into an in-memory buffer, which would actually increase
bloat.

> ... but
> in case I really need, it can be added dynamically to the running
> system.

If you have root or want to setup nonstandard environment variables.

> > This issue was discussed a long time ago and the consensus
> > among users of static linking was that static linking is most
> > valuable when it makes the binary completely "portable" to
> > arbitrary Linux systems for the same cpu arch, without any
> > dependency on having files in particular locations on the
> > system aside from the minimum required by POSIX (things
> > like /dev/null), the standard Linux /proc mountpoint, and
> > universal config files like /etc/resolv.conf (even that is not
> > necessary, BTW, if you have a DNS on localhost). Having iconv
> > not work without external character tables is essentially a
> > form of dynamic linking, and carries with it issues like where
> > the files are to be found (you can override that with an
> > environment variable, but that can't be permitted for setuid
> > binaries), what happens if the format needs to change and the
> > format on the target machine is not compatible with the libc
> > version your binary was built with, etc. This is also the main
> > reason musl does not support something like nss.
> 
> I see the topic of self contained linking, and you are right that
> is is required, but it is fully possible to have best of both
> worlds without much overhead. Writing iconv as a virtual machine

It's not the best of both worlds. It's essentially the same as dynamic
linking.

> interpreter allows to statical link in the conversion byte code
> programs.

At several times the size of the current code/tables, and after the
user searches through the documentation to figure out how to do it.

> > Another side benefit of the current implementation is that it's
> > fully self-contained and independent of any system facilities.
> > It's pure C and can be taken out of musl and dropped in to any
> > program on any C implementation, including freestanding
> > (non-hosted) implementations. If it depended on the filesystem,
> > adapting it for such usage would be a lot more work.
> 
> The virtual machine shall be written in C, I've done such type of
> programming many times. So resulting code will compile with any C
> compiler, and byte code programs are just array of bytes,
> independent of machine byte order. So you will have any further
> dependencies.

It's not just a matter of dropping in. You'd have path searches to
modify or disable, build options to get the static tables turned on,
and all of this stuff would have to be integrated with the build
system for what you're dropping it into.

Complexity is never the solution. Honestly, I would take a 1mb
increase in binary size over this kind of complexity any day.
Thankfully, we don't have to make such a tradeoff.

> > A fsm implementation would be several times larger than the
> > implementations in iconv.c.
> 
> A bit larger, yes ... but not so much, if virtual machine gets
> designed carefully, and it will not increase in size, when there
> are more charsets get added (only size of byte code program
> added).

Charsets are not added. The time of charsets is over. It should have
been over in 1992, when Pike and Thompson made them obsolete, but it's
really over now.

> > It's possible that we could, at some time in the future,
> > support loading of user-defined character conversion files as
> > an added feature, but this should only be for really
> > special-purpose things like custom encodings used for games or
> > obsolete systems (old Mac, console games, IBM mainframes, etc.).
> 
> We can have it all, with not much overhead. And it is not only
> for such special cases. I don't like to install musl on my
> systems with Japanese, Chinese or Korean conversions, but in case
> I really need, I'm able to throw them in, without much work.
> 
> .... and we can add every character conversion on the fly, without
> rebuild of the library.

Maybe we should also include a bytecode interpreter for doing hostname
lookups, since you might want to do something other than DNS or a
hosts file. And a bytecode interpreter for user database lookups in
place of passwd files. And a bytecode interpreter for adding new
crypt() algorithms. And...

> > In terms of the criteria for what to include in musl itself, my
> > idea is that if you have a mail client or web browser based on
> > iconv for its character set handling, you should be able to
> > read the bulk of content in any language.
> 
> If you are building a mail client or web browser, but what if you
> want to include the possibility of charset conversion but stay at
> small size, just including conversions for only system relevant
> conversions, but not limiting to those. Any other conversion can
> then be added on the fly.

Then dynamic link it. If you want an extensible binary, you use
dynamic linking. The main reason for static linking is when you want a
binary whose behavior does not change with the runtime environment --
for example, for security purposes, for carrying around to other
machines that don't have the same runtime environment, etc.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  3:39       ` Rich Felker
@ 2013-08-05  7:53         ` Harald Becker
  2013-08-05  8:24           ` Justin Cormack
  2013-08-05 14:35           ` Rich Felker
  0 siblings, 2 replies; 26+ messages in thread
From: Harald Becker @ 2013-08-05  7:53 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich !

> iconv is not something that needs to be extensible. There is a
> finite set of legacy encodings that's relevant to the world,
> and their relevance is going to go down and down with time, not
> up.

Oh! So you consider Japanese, Chinese, Korean, etc. languages
relevant for programs sitting on my machines? How can you decide
this? Why being so ignorant and trying to write an standard
conform library and then pick out a list of char sets of your
choice which may be possible on iconv, neglecting wishes and
need of any musl user.

... or in other words, if you really be this ignorant and
insist on including those charsets fixed in musl, musl is never
more for me :( ... I don't need to bring in any part of mine into
musl, but I don't consider a lib usable for my needs, which
include several char set files in statical build and neglects to
load seldom used charset definitions from extern in any way.

> 
> > > Do I want to give users who have large volumes of legacy
> > > text in their languages stored in these encodings the same
> > > respect and dignity as users of other legacy encodings we
> > > already support? Yes.
> > 
> > Of course. I won't dictate others which conversions they want
> > to use. I only hat to have plenty of conversion tables on my
> > system when I really know I never use such kind of
> > conversions.
> 
> And your table for just Chinese is as large as all our tables
> combined...

How can you tell this. I don't think so. Such conversion codes
may be very compact. Size is mainly required for translation
tables, that is when code points of the char sets does not match
Unicode character order, but you always need the space for those
translations. The rest won't be much.

> I agree you can make iconv smaller than musl's in the case
> where _no_ legacy DBCS are installed. But if you have just one,
> you'll be just as large or larger than musl with them all.

... musl with them all? I don't consider them smaller than an
optimized byte code interpreter ... not when you are going to
include DBCS char sets fixed into musl. At least if you do all
the required translations.

> compare the size of musl's tables to glibc's converters. I've
> worked hard to make them as small as reasonably possible
> without doing hideous hacks like decompression into an
> in-memory buffer, which would actually increase bloat.

Are you now going to build a lib for startup purpose and embedded
systems only or are you trying to write a general purpose
library? Including all those definitions in a statical build is
definitely not the way I will ever like. This may be done for
some special situations and selected char sets, but not for a
general purpose library, claiming to get a wide usage.

> If you have root or want to setup nonstandard environment
> variables.

What about a charset searchpath including something like
"~/.local/share/charset". This would allow to install charset
files in the users directory.

> > interpreter allows to statical link in the conversion byte
> > code programs.
> 
> At several times the size of the current code/tables, and after
> the user searches through the documentation to figure out how
> to do it.

You definitely consider to include all those code tables
statically into musl? I won't include much more than some
standard sets. Why don't you want to load the charset definitions
as they are required?

On one hand you say "use dietlibc" if you need small statical
programs and on the other hand you want to include many charset
definitions into a statical build to avoid dynamic loading of
tables, required only on embedded systems.

So what's the purpose of musl? I don't think you stay right here.

> It's not just a matter of dropping in. You'd have path searches
> to modify or disable, build options to get the static tables
> turned on, and all of this stuff would have to be integrated
> with the build system for what you're dropping it into.

I don't see the required complexity. In fact I won't have a lib
that includes several charset definitions in a statical build. I
really like to have a directory with definition files for those
char sets and don't see the complexity for this you proclamate.

Inclusion in statical build is not more than selection of the
charsets you want o be included statically. This selection is
always required or you include all files , which I definitly
neglect.

> Complexity is never the solution. Honestly, I would take a 1mb
> increase in binary size over this kind of complexity any day.
> Thankfully, we don't have to make such a tradeoff.

The only complexity which we has here is the complexity of
charset translation. The rest is relatively simple.

> Charsets are not added. The time of charsets is over. It should
> have been over in 1992, when Pike and Thompson made them
> obsolete, but it's really over now.

So why are you adding Japanese, Chinese and Korean charsets to an
iconv conversion in musl? Why not just using UTF-8? Whenever you
use iconv you want the flexibility to do all required charset
conversions. Which means you need to statically link in many
charset definitions or you need to dynamically load what is
required.

> Then dynamic link it. If you want an extensible binary, you use
> dynamic linking.

Dynamic linking of mail client, ok and where go the charset
definition files? Are they all packed into your libc.so? That is
a very big file? Why do I need to have Asian language definition
on my disk, when I do not want?

It is your decision, but please state clear what purpose you are
building musl. Here it looks you are mixing things and steping in
a direction I will never like.

--
Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  7:53         ` Harald Becker
@ 2013-08-05  8:24           ` Justin Cormack
  2013-08-05 14:43             ` Rich Felker
  2013-08-05 14:35           ` Rich Felker
  1 sibling, 1 reply; 26+ messages in thread
From: Justin Cormack @ 2013-08-05  8:24 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 6597 bytes --]

On 5 Aug 2013 08:53, "Harald Becker" <ralda@gmx.de> wrote:
>
> Hi Rich !
>
> > iconv is not something that needs to be extensible. There is a
> > finite set of legacy encodings that's relevant to the world,
> > and their relevance is going to go down and down with time, not
> > up.
>
> Oh! So you consider Japanese, Chinese, Korean, etc. languages
> relevant for programs sitting on my machines? How can you decide
> this? Why being so ignorant and trying to write an standard
> conform library and then pick out a list of char sets of your
> choice which may be possible on iconv, neglecting wishes and
> need of any musl user.
>
> ... or in other words, if you really be this ignorant and
> insist on including those charsets fixed in musl, musl is never
> more for me :( ... I don't need to bring in any part of mine into
> musl, but I don't consider a lib usable for my needs, which
> include several char set files in statical build and neglects to
> load seldom used charset definitions from extern in any way.

They are not going to be "fixed" just don't build them. It is not hard with
Musl. Just add this into your build script.

One of the nice features of Musl is that it appeals to a broader audience
than just "embedded" so it is always going to have stuff you can cut out if
you want absolute minimalism but this means it will get wider usage.

Adding external files has many disadvantages to other people. If you don't
want these conversions external files do not help you.

Making software for more than one person involves compromises so please
calm down a bit. Use your own embedded build with the parts you don't need
omitted.

Justin

> >
> > > > Do I want to give users who have large volumes of legacy
> > > > text in their languages stored in these encodings the same
> > > > respect and dignity as users of other legacy encodings we
> > > > already support? Yes.
> > >
> > > Of course. I won't dictate others which conversions they want
> > > to use. I only hat to have plenty of conversion tables on my
> > > system when I really know I never use such kind of
> > > conversions.
> >
> > And your table for just Chinese is as large as all our tables
> > combined...
>
> How can you tell this. I don't think so. Such conversion codes
> may be very compact. Size is mainly required for translation
> tables, that is when code points of the char sets does not match
> Unicode character order, but you always need the space for those
> translations. The rest won't be much.
>
> > I agree you can make iconv smaller than musl's in the case
> > where _no_ legacy DBCS are installed. But if you have just one,
> > you'll be just as large or larger than musl with them all.
>
> ... musl with them all? I don't consider them smaller than an
> optimized byte code interpreter ... not when you are going to
> include DBCS char sets fixed into musl. At least if you do all
> the required translations.
>
> > compare the size of musl's tables to glibc's converters. I've
> > worked hard to make them as small as reasonably possible
> > without doing hideous hacks like decompression into an
> > in-memory buffer, which would actually increase bloat.
>
> Are you now going to build a lib for startup purpose and embedded
> systems only or are you trying to write a general purpose
> library? Including all those definitions in a statical build is
> definitely not the way I will ever like. This may be done for
> some special situations and selected char sets, but not for a
> general purpose library, claiming to get a wide usage.
>
> > If you have root or want to setup nonstandard environment
> > variables.
>
> What about a charset searchpath including something like
> "~/.local/share/charset". This would allow to install charset
> files in the users directory.
>
> > > interpreter allows to statical link in the conversion byte
> > > code programs.
> >
> > At several times the size of the current code/tables, and after
> > the user searches through the documentation to figure out how
> > to do it.
>
> You definitely consider to include all those code tables
> statically into musl? I won't include much more than some
> standard sets. Why don't you want to load the charset definitions
> as they are required?
>
> On one hand you say "use dietlibc" if you need small statical
> programs and on the other hand you want to include many charset
> definitions into a statical build to avoid dynamic loading of
> tables, required only on embedded systems.
>
> So what's the purpose of musl? I don't think you stay right here.
>
> > It's not just a matter of dropping in. You'd have path searches
> > to modify or disable, build options to get the static tables
> > turned on, and all of this stuff would have to be integrated
> > with the build system for what you're dropping it into.
>
> I don't see the required complexity. In fact I won't have a lib
> that includes several charset definitions in a statical build. I
> really like to have a directory with definition files for those
> char sets and don't see the complexity for this you proclamate.
>
> Inclusion in statical build is not more than selection of the
> charsets you want o be included statically. This selection is
> always required or you include all files , which I definitly
> neglect.
>
> > Complexity is never the solution. Honestly, I would take a 1mb
> > increase in binary size over this kind of complexity any day.
> > Thankfully, we don't have to make such a tradeoff.
>
> The only complexity which we has here is the complexity of
> charset translation. The rest is relatively simple.
>
> > Charsets are not added. The time of charsets is over. It should
> > have been over in 1992, when Pike and Thompson made them
> > obsolete, but it's really over now.
>
> So why are you adding Japanese, Chinese and Korean charsets to an
> iconv conversion in musl? Why not just using UTF-8? Whenever you
> use iconv you want the flexibility to do all required charset
> conversions. Which means you need to statically link in many
> charset definitions or you need to dynamically load what is
> required.
>
> > Then dynamic link it. If you want an extensible binary, you use
> > dynamic linking.
>
> Dynamic linking of mail client, ok and where go the charset
> definition files? Are they all packed into your libc.so? That is
> a very big file? Why do I need to have Asian language definition
> on my disk, when I do not want?
>
> It is your decision, but please state clear what purpose you are
> building musl. Here it looks you are mixing things and steping in
> a direction I will never like.
>
> --
> Rich

[-- Attachment #2: Type: text/html, Size: 8041 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  8:24           ` Justin Cormack
@ 2013-08-05 14:43             ` Rich Felker
  0 siblings, 0 replies; 26+ messages in thread
From: Rich Felker @ 2013-08-05 14:43 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 09:24:37AM +0100, Justin Cormack wrote:
> They are not going to be "fixed" just don't build them. It is not hard with
> Musl. Just add this into your build script.

Indeed. My intent is for it to be fully-functional-as-shipped. If
somebody needs to cripple certain interfaces to meet extreme size
requirements, that's an ok local modification, and it might even be
acceptable as a configure option if enough people legitimately request
it.

> One of the nice features of Musl is that it appeals to a broader audience
> than just "embedded" so it is always going to have stuff you can cut out if
> you want absolute minimalism but this means it will get wider usage.

Cutting out math/*, complex/*, and most of crypt/* would save at least
as much space as iconv, and there are plenty of places these aren't
needed either. It's not for me to decide which options you can omit.
Thankfully, due to musl's correct handling of static linking, you
usually don't have to think about it either. You just static link and
get only what you need.

> Adding external files has many disadvantages to other people. If you don't
> want these conversions external files do not help you.

External files also do not make things work "by default". They only
work if musl has been installed system-wide according to our
directions (which not everbody will follow) or if the user has done
the research to figure out how to work around it not being installed
system-wide.

> Making software for more than one person involves compromises so please
> calm down a bit. Use your own embedded build with the parts you don't need
> omitted.

Exactly. Where musl excels here is by not _forcing_ you to use iconv.
I take great care not to force linking of components you might not
want to see in your output binary size, and for TLS, which
unfortunately was misdesigned in such a way that the linker can't see
if TLS is used or not for the purpose of deciding whether to link the
TLS init code, I went to great lengths both to minimize the size of
__init_tls.o and to make it easy, as a local customization, to omit
this module. But as an analogy, I would not have even considered
asking musl users who need TLS to add special CFLAGS, libraries, etc.
when building programs. That's an unreasonable burden and it's broken
because it does not "work by default".

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  7:53         ` Harald Becker
  2013-08-05  8:24           ` Justin Cormack
@ 2013-08-05 14:35           ` Rich Felker
  1 sibling, 0 replies; 26+ messages in thread
From: Rich Felker @ 2013-08-05 14:35 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 09:53:32AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> > iconv is not something that needs to be extensible. There is a
> > finite set of legacy encodings that's relevant to the world,
> > and their relevance is going to go down and down with time, not
> > up.
> 
> Oh! So you consider Japanese, Chinese, Korean, etc. languages
> relevant for programs sitting on my machines? How can you decide

I don't decide what's relevant for you. Rather, I don't have the
authority to declare it irrelevant-by-default. This is true even for
things like crypt algorithms (does anybody really want to use md5??)
but especially for anything that would preclude somebody from being
able to receive data in their native language. Simple multilingual
support via UTF-8 with conversion from legacy data has been near top
priority, if not top, since the conception of musl.

If history has shown us anything, it's that universal support for all
languages must be default and turning off some support to save space
(which is rarely if ever actually needed) needs to be a conscious
decision. I'm no Apple fan by any means, but just look at the
situation on iOS: you can turn on a new iPhone or iPad and read data
in any language (including having the relevant fonts!) and even add a
keyboard and type in almost any language, without having to buy a
special localized version or install add-ons. This is very different
from the situation on Android right now.

musl's intended applicability is broad. From industrial control to
settop boxes, in-car entertainment, initramfs images for desktop
machines, phones, tablets, plug computers that run your private home
or office webmail server, full desktops, VE LAMP stacks, hosts for
VEs, etc. Some of these usages have a real need for human-language
text; others don't. But if we have the power to make it such that, if
someone uses a musl to implement a plug computer for webmail, it
naturally supports all languages unless the maker of the device goes
and actively rips that support out, then we have a responsibility to
do so. Or, said differently, it's OUR FAULT for making
broken-by-default software if language support is missing unless you
go to the effort of learning musl-specific ways to enable it.

> this? Why being so ignorant and trying to write an standard
> conform library and then pick out a list of char sets of your
> choice which may be possible on iconv, neglecting wishes and
> need of any musl user.

If I were to just accept your demands, it would essentially mean:

(1) discarding the opinions of everybody else who discussed this issue
in the past and decided that static linking should mean real static
binaries that work the same without needing extra files in the
filesystem..

(2) discarding the informed decisions I made based on said
discussions.

> .... or in other words, if you really be this ignorant and
> insist on including those charsets fixed in musl, musl is never
> more for me :( ... I don't need to bring in any part of mine into
> musl, but I don't consider a lib usable for my needs, which
> include several char set files in statical build and neglects to
> load seldom used charset definitions from extern in any way.

Name the extra "seldom used charset definitions" you're interested in.
They're probably already supported. We are not discussing adding some
new giant subsystem to musl. We are discussing adding the last two
missing major legacy charsets to an existing framework that's existed
for a long time.

> > > > Do I want to give users who have large volumes of legacy
> > > > text in their languages stored in these encodings the same
> > > > respect and dignity as users of other legacy encodings we
> > > > already support? Yes.
> > > 
> > > Of course. I won't dictate others which conversions they want
> > > to use. I only hat to have plenty of conversion tables on my
> > > system when I really know I never use such kind of
> > > conversions.
> > 
> > And your table for just Chinese is as large as all our tables
> > combined...
> 
> How can you tell this. I don't think so.

You're welcome to implement it and see. Thanks to the way static
linking works, if you add -lyouriconv when static linking, the iconv
in musl will be completely omitted from the binary and yours will be
used instead. Of course the iconv in musl will be completely omitted
anyway except in the small number of programs that actually use iconv.
This is not glibc where stdio and locale depend on iconv. iconv is
purely iconv.

> Such conversion codes
> may be very compact. Size is mainly required for translation
> tables, that is when code points of the char sets does not match
> Unicode character order, but you always need the space for those
> translations. The rest won't be much.

That's all the size. The VAST majority of the table size is for 4
major character encoding families, those based on:

- JIS 0208
- GB 18030
- KS X 1001
- Big5

As for legacy 8-bit encodings, musl's approach to them is also more
efficient than you could easily be with a state machine. The fact that
the number of codepoints that ever appear in an 8-bit encoding is less
than 1024 is used to store the mappings as 10-bit-per-entry packed
arrays of indices into the legacy_chars table. This reduces the
marginal cost of individual 8bit encodings by 25% (versus 16-bit
entries). The ASCII range and any span upward into the high range that
maps directly to Unicode codepoints is also elided from the table
(which reduces ISO-8859-* by another 62.5%).

In short, what we have is about the smallest possible representation
you can get without applying LZMA or something (and thereby needing
all the code to decompress and dirty pages to store the decompressed
version). It's hard to beat.

By the way, if you really want to save the space they take, you could
just delete this email thread from your mail folder. It's larger than
musl's iconv already. :-)

> > I agree you can make iconv smaller than musl's in the case
> > where _no_ legacy DBCS are installed. But if you have just one,
> > you'll be just as large or larger than musl with them all.
> 
> .... musl with them all? I don't consider them smaller than an
> optimized byte code interpreter ... not when you are going to
> include DBCS char sets fixed into musl. At least if you do all
> the required translations.

I may have been exaggerating a little bit, but I doubt you can get
your bytecode GB18030 support smaller than about 110k once you count
the bytecode and the interpreter binary. I'm even more doubtful that
you can get it smaller than the current 71k in musl.

> > compare the size of musl's tables to glibc's converters. I've
> > worked hard to make them as small as reasonably possible
> > without doing hideous hacks like decompression into an
> > in-memory buffer, which would actually increase bloat.
> 
> Are you now going to build a lib for startup purpose and embedded
> systems only or are you trying to write a general purpose
> library?

General-purpose. Have you not read the website?

    Originally in the 1990s, Linux-based systems used a fork of the
    GNU C library (glibc) version 1, which existed in various versions
    (libc4, libc5). Later, distributions adopted the more mature
    version 2 of glibc, and denoted it libc6. Since then, other
    specialized C library implementations such as uClibc and dietlibc
    have emerged as well.

    musl is a new general-purpose implementation of the C library. It
    is lightweight, fast, simple, free, and aims to be correct in the
    sense of standards-conformance and safety.

If you're using it for startup purposes or embedded systems that don't
communicate with humans in human language, you won't be running
applications that call iconv() and thus it's irrelevant.

> On one hand you say "use dietlibc" if you need small statical
> programs and on the other hand you want to include many charset
> definitions into a statical build to avoid dynamic loading of
> tables, required only on embedded systems.

Where did I say "use dietlibc"? If I did (I don't really remember) it
was not a serious recommendation but a sarcastic remark to make a
point that musl is not about being "smallest-at-all-costs" (and
thereby broken) like dietlibc is.

> > have been over in 1992, when Pike and Thompson made them
> > obsolete, but it's really over now.
> 
> So why are you adding Japanese, Chinese and Korean charsets to an
> iconv conversion in musl? Why not just using UTF-8? Whenever you
> use iconv you want the flexibility to do all required charset
> conversions. Which means you need to statically link in many
> charset definitions or you need to dynamically load what is
> required.

The time of creating charsets is over. That does not magically make
the data created in those charsets in the past go away or convert
itself to UTF-8. It doesn't even magically stop people from making new
data in those charsets. All it means is that governments, vendors,
etc. have stopped the madness of making new charsets.

> > Then dynamic link it. If you want an extensible binary, you use
> > dynamic linking.
> 
> Dynamic linking of mail client, ok and where go the charset
> definition files? Are they all packed into your libc.so? That is
> a very big file? Why do I need to have Asian language definition
> on my disk, when I do not want?

Because any other solution would be larger, would defeat the purpose
of static linking, and would contribute to the problem of poor
multilingual support. Why are you upset about these tables and not
other tables like crypto sboxes, wcwidth, character classes, bits of
2/pi and pi/2, etc.? By the way, math/*.o are also fairly large, on
the same order of magnitude as iconv; would you also suggest we move
it all out to bytecode loaded at runtime even in static binaries?

> It is your decision, but please state clear what purpose you are
> building musl. Here it looks you are mixing things and steping in
> a direction I will never like.

This has all been documented all along. I'm sorry you don't understand
the goals of the project. Perhaps your misunderstanding is what
"general purpose" means. It does not mean we omit anything that could
offend anyone by wasting a few bytes on their hard drive. It means we
don't cut corners that break important usage cases. Having a complete
iconv linked whenever you link a program using iconv() does not break
your usage case unless you have less than 100k of disk/ssd/rom storage
to spare, and in that case, you probably shouldn't be using iconv. If
anyone ever does have a practical difficulty because of this, rather
than theoretical complaints based on anglocentricism, eurocentricism,
and/or xenophobia, I am not entirely opposed to making a build option
to omit iconv tables, but it has to be well-motivated.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker
  2013-08-04 22:39 ` Harald Becker
@ 2013-08-05  0:46 ` Harald Becker
  2013-08-05  5:00 ` Rich Felker
  2013-08-05  8:28 ` Roy
  3 siblings, 0 replies; 26+ messages in thread
From: Harald Becker @ 2013-08-05  0:46 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich,

in addition to my previous message to clarify some things:

04-08-2013 12:51 Rich Felker <dalias@aerifal.cx>:

> Worst-case, adding Korean and Traditional Chinese tables will
> roughly double the size of iconv.o to around 150k. This will
> noticably enlarge libc.so, but will make no difference to
> static-linked programs except those using iconv. I'm hoping we
> can make these additions less expensive, but I don't see a good
> way yet.

I would write iconv as a virtual machine interpreter for a very
simple byte code machine. The byte code (program) of the virtual
machine is just an array of unsigned bytes and the virtual
machine only contains the instructions to read next byte and
assemble a Unicode value or to receive a Unicode value and to
produce multi byte character output. The virtual machine code
itself works like a finite state machine to handle multi byte
character sets.

That way iconv consist of a small byte code interpreter to build
the virtual machine. Then it maps the byte code from an external
file for any required character set. This byte code from external
file consist of virtual machine instructions and conversion
tables. As this virtual machine shall be optimized for the
conversion purposes, conversion operations require only
interpretation of a view virtual instructions per converted
character (for simple character sets, big ones may need a few
more instructions). This operation is usually very fast, as not
much data is involved and instructions are highly optimized for
conversion operation.

The virtual machine works with a data space of only a few bytes
(less than 256), where some bytes need to preserve from one
conversion call to next. That is conversion needs a conversion
context of a few bytes (8..16). 

Independently from any character set conversion you want to add,
you only need a single byte code interpreter for iconv, which
will not increase in size. Only the external byte code /
conversion table for the charsets may vary in size. Simple char
sets, like Latins, consist of only a few bytes of byte code, big
charsets like Japanese, Chinese and Korean, need some more byte
code and may be some bigger translation tables ... but those
tables are only loaded if iconv needs to access such a charset.

iconv itself doesn't need to handle table of available charsets,
it only converts the charset name into a filename and opens the
corresponding charset translation file. On the charset file some
header and version check shall handle possible installation
conflicts. For any conversion request the virtual machine
interpreter runs through the byte code of the requested charset
and returns the conversion result. As the virtual machine shall
not contain operations to violate the remainder of the system,
this shall not break system security. At most the byte code is so
misbehaved that it runs forever, without producing an error or
any output. So the machine hangs just in an infinite loop during
conversion, until the process is terminated (a simple counter may
limit number of executed instructions and bail out in case of such
looping).

> At some point, especially if the cost is not reduced, I will
> probably add build-time options to exclude a configurable
> subset of the supported character encodings. This would not be
> extremely fine-grained, and the choices to exclude would
> probably be just: Japanese, Simplified Chinese, Traditional
> Chinese, and Korean. Legacy 8-bit might also be an option but
> these are so small I can't think of cases where it would be
> beneficial to omit them (5k for the tables on top of the 2k of
> actual code in iconv). Perhaps if there are cases where iconv
> is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices,
> dropping the 8-bit tables and all of the support code could be
> useful; the resulting iconv would be around 1k, I think.

You may skip all this, if iconv is constructed as a virtual
machine interpreter and all character conversions are loaded from
an external file.

As a fallback the library may compile in the byte code for some
small charset conversions, like ASCII, Latin-1, UTF-8. All other
charset conversions are loaded from external resources, which may
be installed or not depending on admins decision. And just
added if required later.  

--
Harald

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker
  2013-08-04 22:39 ` Harald Becker
  2013-08-05  0:46 ` Harald Becker
@ 2013-08-05  5:00 ` Rich Felker
  2013-08-05  8:28 ` Roy
  3 siblings, 0 replies; 26+ messages in thread
From: Rich Felker @ 2013-08-05  5:00 UTC (permalink / raw)
  To: musl

On Sun, Aug 04, 2013 at 12:51:52PM -0400, Rich Felker wrote:
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
> 
> Korean:
> CP949
> Lead byte range is extended to 81-FD (125)
> Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
> 44500 bytes table space
> 
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
> 
> The big remaining questions are:
> 
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).

For what it's worth, there is no IANA charset registration for any
supplement to Korean. See the table here:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

The only entries for Korean are ISO-2022-KR and EUC-KR.

Big5-HKSCS however is registered. This matches my intuition that, of
the two, HKSCS would be more important to real-world usage than Korean
extensions.

If we were to omit CP949 and just go with KS X 1001, but include
HKSCS, the total size (minus a minimal amount of code needed) would be
17484+37366 = 54850.

With both supported, it would be 44500+37366 = 81866.

With just KS X 1001 and base Big5, it would be 17484+27946 = 45430.

Being that HKSCS is a standard, registered MIME charset and the cost
is only 10k, and that it seems necessary for real world usage in Hong
Kong, I think it's pretty obvious that we should support it. So I
think the question we're left with is whether the CP949 (MS encoding)
extension for Korean is important to support. The cost is roughly 37k.

I'm going to keep doing research to see if identifying the characters
added in it sheds any light on whether there are important additions.
Obviously I would like to be able to exclude it but I don't want this
decision to be made unfairly based on my bias when it comes to bloat.
:)

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker
                   ` (2 preceding siblings ...)
  2013-08-05  5:00 ` Rich Felker
@ 2013-08-05  8:28 ` Roy
  2013-08-05 15:43   ` Rich Felker
  2013-08-05 19:12   ` Rich Felker
  3 siblings, 2 replies; 26+ messages in thread
From: Roy @ 2013-08-05  8:28 UTC (permalink / raw)
  To: musl

Since I'm a Traditional Chinese and Japanese legacy encoding user, I think  
I can say something here.

Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> OK, so here's what I've found so far. Both legacy Korean and legacy
> Traditional Chinese encodings have essentially a single base character
> set:
>

>
> Traditional Chinese:
> Big5 (CP950)
> 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
> All characters in BMP
> 27946 bytes table space
>
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
>
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
>

There is another Big5 extension called Big5-UAO, which is being used in  
world's largest telnet-based BBS called "ptt.cc".

It has two tables, one for Big5-UAO to Unicode, another one is Unicode to  
Big5-UAO.
http://moztw.org/docs/big5/table/uao250-b2u.txt
http://moztw.org/docs/big5/table/uao250-u2b.txt

Which extends DBCS lead byte to 0x81.

> The big remaining questions are:
>
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).

For Big5-UAO, it contains Japanese and Simplified Chinese characters which  
do not exist in original MS-CP950 implementation.

>
> 2. Are there patterns to exploit? For Korean, ALL of the Hangul
> characters are actually combinations of several base letters. Unicode
> encodes them all sequentially in a pattern where the conversion to
> their constitutent letters is purely algorithmic, but there seems to
> be no clean pattern in the legacy encodings, as the encodings started
> out just incoding the "important" ones then adding less important
> combinations in separate ranges.

In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in  
Japanese) and Japanese Katakana/Hiragana besides of Hangul characters.

>
> Worst-case, adding Korean and Traditional Chinese tables will roughly
> double the size of iconv.o to around 150k. This will noticably enlarge
> libc.so, but will make no difference to static-linked programs except
> those using iconv. I'm hoping we can make these additions less
> expensive, but I don't see a good way yet.

For static linking, can we have conditional linking like QT does?
In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables.

#ifndef QT_SHARED
     #include <QtPlugin>

     Q_IMPORT_PLUGIN(qcncodecs)
     Q_IMPORT_PLUGIN(qjpcodecs)
     Q_IMPORT_PLUGIN(qkrcodecs)
     Q_IMPORT_PLUGIN(qtwcodecs)
#endif


>
> At some point, especially if the cost is not reduced, I will probably
> add build-time options to exclude a configurable subset of the
> supported character encodings. This would not be extremely
> fine-grained, and the choices to exclude would probably be just:
> Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
> 8-bit might also be an option but these are so small I can't think of
> cases where it would be beneficial to omit them (5k for the tables on
> top of the 2k of actual code in iconv). Perhaps if there are cases
> where iconv is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices, dropping the
> 8-bit tables and all of the support code could be useful; the
> resulting iconv would be around 1k, I think.
>
> Rich
>

HTH,
Roy



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  8:28 ` Roy
@ 2013-08-05 15:43   ` Rich Felker
  2013-08-05 17:31     ` Rich Felker
  2013-08-05 19:12   ` Rich Felker
  1 sibling, 1 reply; 26+ messages in thread
From: Rich Felker @ 2013-08-05 15:43 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote:
> Since I'm a Traditional Chinese and Japanese legacy encoding user, I
> think I can say something here.

Great, thanks for joining in with some constructive input! :)

> >Traditional Chinese:
> >HKSCS (CP951)
> >Lead byte range is extended to 88-FE (119)
> >1651 characters outside BMP
> >37366 bytes table space for 16-bit mapping table, plus extra mapping
> >needed for characters outside BMP
> 
> There is another Big5 extension called Big5-UAO, which is being used
> in world's largest telnet-based BBS called "ptt.cc".
> 
> It has two tables, one for Big5-UAO to Unicode, another one is
> Unicode to Big5-UAO.
> http://moztw.org/docs/big5/table/uao250-b2u.txt
> http://moztw.org/docs/big5/table/uao250-u2b.txt
> 
> Which extends DBCS lead byte to 0x81.

Is it a superset of HKSCS or does it assign different characters to
the range covered by HKSCS?

> In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji
> characters in Japanese) and Japanese Katakana/Hiragana besides of
> Hangul characters.

Yes, I'm aware of these. However, it looks to me like the only
characters outside the standard 94x94 grid zone are Hangul syllables,
and they appear in codepoint order. If so, even if there's not a good
pattern to where they're located, merely knowing that the ones that
are missing from the 94x94 grid are placed in order in the expanded
space is sufficient to perform algorithmic (albeit inefficient)
conversion. Does this sound correct?

> >Worst-case, adding Korean and Traditional Chinese tables will roughly
> >double the size of iconv.o to around 150k. This will noticably enlarge
> >libc.so, but will make no difference to static-linked programs except
> >those using iconv. I'm hoping we can make these additions less
> >expensive, but I don't see a good way yet.
> 
> For static linking, can we have conditional linking like QT does?

My feeling is that it's a tradeoff, and probably has more pros than
cons. Unlike QT, musl's iconv is extremely small. Even with all the
above, the size of iconv.o will be under 130k, maybe closer to 110k.
If you actually use iconv in your program, this is a small price to
pay for having it fully functional. On the other hand, if linking it
is conditional, you have to consider who makes the decision, and when.
If it's at link time for each application, that's probably too much of
a musl-specific version. If it's at build time for musl, then is it
your device vendor deciding for you what languages you need? One of
the biggest headaches of uClibc-based systems is finding that the
system libc was built with important options you need turned off and
that you need to hack in a replacement to get something working...

I think the cost of getting stuck with broken binaries where charsets
were omitted is sufficiently greater than the cost of adding a few
tens of kb to static binaries using iconv, that we should only
consider a build time option if embedded users are actively reporting
size problems.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-05 15:43   ` Rich Felker
@ 2013-08-05 17:31     ` Rich Felker
  0 siblings, 0 replies; 26+ messages in thread
From: Rich Felker @ 2013-08-05 17:31 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 11:43:45AM -0400, Rich Felker wrote:
> > In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji
> > characters in Japanese) and Japanese Katakana/Hiragana besides of
> > Hangul characters.
> 
> Yes, I'm aware of these. However, it looks to me like the only
> characters outside the standard 94x94 grid zone are Hangul syllables,
> and they appear in codepoint order. If so, even if there's not a good
> pattern to where they're located, merely knowing that the ones that
> are missing from the 94x94 grid are placed in order in the expanded
> space is sufficient to perform algorithmic (albeit inefficient)
> conversion. Does this sound correct?

I've verified that this is correct and committed an implementation of
Korean based on this principle, which I basically copied from my
current implementation of GB18030's support for arbitrary Unicode
codepoints. It has not been heavily tested but I did test it casually
with all the important boundary values and it seems correct. Tests
should probably be added to the test suite.

Rich


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  8:28 ` Roy
  2013-08-05 15:43   ` Rich Felker
@ 2013-08-05 19:12   ` Rich Felker
  2013-08-06  6:14     ` Roy
  1 sibling, 1 reply; 26+ messages in thread
From: Rich Felker @ 2013-08-05 19:12 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote:
> Since I'm a Traditional Chinese and Japanese legacy encoding user, I
> think I can say something here.
> [...]
> There is another Big5 extension called Big5-UAO, which is being used
> in world's largest telnet-based BBS called "ptt.cc".
> 
> It has two tables, one for Big5-UAO to Unicode, another one is
> Unicode to Big5-UAO.
> http://moztw.org/docs/big5/table/uao250-b2u.txt
> http://moztw.org/docs/big5/table/uao250-u2b.txt
> 
> Which extends DBCS lead byte to 0x81.

OK, I've been trying to do some research on this and I turned up:

http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0061.html
http://lists.gnu.org/archive/html/bug-gnu-libiconv/2010-11/msg00007.html

My impression (please correct me if I'm wrong) is that you can't use
Big5-UAO as the system encoding on modern versions of Windows (just
ancient ones where you install unmaintained third-party software that
hacks the system charset tables) and that it's not supported in GNU
libiconv. If this is the case, and especially if Big5-UAO's main use
is on a telnet-based BBS where everybody is using special telnet
clients that have their own Big5-UAO converters, I'd find it really
hard to justify trying to support this. But I'm open to hearing
arguments on why we should, if you believe it's important.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-05 19:12   ` Rich Felker
@ 2013-08-06  6:14     ` Roy
  2013-08-06 13:32       ` Rich Felker
  0 siblings, 1 reply; 26+ messages in thread
From: Roy @ 2013-08-06  6:14 UTC (permalink / raw)
  To: musl

Tue, 06 Aug 2013 03:12:47 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote:
>> Since I'm a Traditional Chinese and Japanese legacy encoding user, I
>> think I can say something here.
>> [...]
>> There is another Big5 extension called Big5-UAO, which is being used
>> in world's largest telnet-based BBS called "ptt.cc".
>>
>> It has two tables, one for Big5-UAO to Unicode, another one is
>> Unicode to Big5-UAO.
>> http://moztw.org/docs/big5/table/uao250-b2u.txt
>> http://moztw.org/docs/big5/table/uao250-u2b.txt
>>
>> Which extends DBCS lead byte to 0x81.
>
> OK, I've been trying to do some research on this and I turned up:
>
> http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0061.html
> http://lists.gnu.org/archive/html/bug-gnu-libiconv/2010-11/msg00007.html
>
> My impression (please correct me if I'm wrong) is that you can't use
> Big5-UAO as the system encoding on modern versions of Windows (just
> ancient ones where you install unmaintained third-party software that
> hacks the system charset tables)

It doesn't "hack" the nls file but replaces with UAO-available CP950 nls  
file.
The executable(setup program) is generated with NSIS(Nullsoft Scriptable  
Install System).
Since the nls file format doesn't change since NT 3.1 in 1993 till now NT  
6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will continue to  
work in newer versions of windows unless MS throw away nls file format  
with something different.

> and that it's not supported in GNU
> libiconv. If this is the case, and especially if Big5-UAO's main use
> is on a telnet-based BBS where everybody is using special telnet
> clients that have their own Big5-UAO converters,

GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful  
SBCS+DBCS)!
So does it matter if GNU libiconv is not support whatever encodings? (Yes  
glibc iconv(or say, gconv modules) does support both IBM EBCDIC SBCS and  
stateful SBCS+DBCS encodings)

> I'd find it really
> hard to justify trying to support this. But I'm open to hearing
> arguments on why we should, if you believe it's important.

I think it will be nice to have build/link time option for those  
"unpopular" encodings.

>> For static linking, can we have conditional linking like QT does?
>
> My feeling is that it's a tradeoff, and probably has more pros than
> cons. Unlike QT, musl's iconv is extremely small.

I would add "right now" here. When we adds more encoding later, iconv  
module will be bigger than now, and people will need to find a way to  
conditionally compiling the encoding they need (for both dynamically or  
statically)

> Even with all the
> above, the size of iconv.o will be under 130k, maybe closer to 110k.
> If you actually use iconv in your program, this is a small price to
> pay for having it fully functional. On the other hand, if linking it
> is conditional, you have to consider who makes the decision, and when.
> If it's at link time for each application, that's probably too much of
> a musl-specific version.

Since statically linking libc-iconv is new area now (other libc doesn't  
touch this topic much), I think we can create standard for statically  
linking specified encoding table in link time.
(This is also a reason of "why libc should provide an unique identifier  
with preprocessor define")

> If it's at build time for musl, then is it
> your device vendor deciding for you what languages you need? One of
> the biggest headaches of uClibc-based systems is finding that the
> system libc was built with important options you need turned off and
> that you need to hack in a replacement to get something working...
>
> I think the cost of getting stuck with broken binaries where charsets
> were omitted is sufficiently greater than the cost of adding a few
> tens of kb to static binaries using iconv, that we should only
> consider a build time option if embedded users are actively reporting
> size problems.

>
> Rich



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-06  6:14     ` Roy
@ 2013-08-06 13:32       ` Rich Felker
  2013-08-06 15:11         ` Roy
  0 siblings, 1 reply; 26+ messages in thread
From: Rich Felker @ 2013-08-06 13:32 UTC (permalink / raw)
  To: musl

On Tue, Aug 06, 2013 at 02:14:33PM +0800, Roy wrote:
> >My impression (please correct me if I'm wrong) is that you can't use
> >Big5-UAO as the system encoding on modern versions of Windows (just
> >ancient ones where you install unmaintained third-party software that
> >hacks the system charset tables)
> 
> It doesn't "hack" the nls file but replaces with UAO-available CP950
> nls file.
> The executable(setup program) is generated with NSIS(Nullsoft
> Scriptable Install System).
> Since the nls file format doesn't change since NT 3.1 in 1993 till
> now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will
> continue to work in newer versions of windows unless MS throw away
> nls file format with something different.

OK, thanks for clarifying. I'd still consider it a ways into the
"hack" domain if the OS vendor still is not supporting it directly,
but it does make a difference that it still works "cleanly". I was
under the impression that these sorts of things changes between
Windows versions in ways that would preclude using old, unmaintained
patches like this. I agree that just the fact that certain OS vendors
do not support an encoding is not in itself a reason not to support
it.

> >and that it's not supported in GNU
> >libiconv. If this is the case, and especially if Big5-UAO's main use
> >is on a telnet-based BBS where everybody is using special telnet
> >clients that have their own Big5-UAO converters,
> 
> GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful
> SBCS+DBCS)!
> 
> So does it matter if GNU libiconv is not support whatever encodings?
> (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC
> SBCS and stateful SBCS+DBCS encodings)

I was under the impression that GNU libiconv was in sync with glibc's
iconv, but I have not checked this. I actually was more interested in
glibc's, which is in widespread use. glibc's inclusion or exclusion of
a feature is not in itself a reason to include or exclude it, but
supporting something that glibc supports does have the added
motivation that it will increase compatibility with what programs are
expecting.

> >I'd find it really
> >hard to justify trying to support this. But I'm open to hearing
> >arguments on why we should, if you believe it's important.
> 
> I think it will be nice to have build/link time option for those
> "unpopular" encodings.
> 
> >>For static linking, can we have conditional linking like QT does?
> >
> >My feeling is that it's a tradeoff, and probably has more pros than
> >cons. Unlike QT, musl's iconv is extremely small.
> 
> I would add "right now" here. When we adds more encoding later,
> iconv module will be bigger than now, and people will need to find a
> way to conditionally compiling the encoding they need (for both
> dynamically or statically)

It's never been my intent to add more encodings later (aside from pure
non-table-based variants of existing ones, like the ISO-2022 versions)
once coverage is complete, at least not as built-in features. This can
be discussed if you think there are reasons it needs to change, but up
until now, the plan has been to support:

- ISO-8859 based 8-bit encodings
- Other 8-bit encodings with actual legacy usage (mainly Cyrillic)
- JIS 0208 based encodings
- KS X 1001 based encodings
- GB 2312 and supersets
- Big5 and supersets

All of those except Big5 and supersets are now supported, so short of
any change, my position is that right now we're discussing the "last"
significant addition to musl's iconv.

Some things that are definitely outside the scope of musl's iconv:

- Anything whose characters are not present in Unicode
- Anything PUA-based (really, same as above)
- Newly invented encodings with no historical encoded data

What's more borderline is where UAO falls: encodings that have neither
governmental or language-body-authority support nor any vendor support
from other software vendors, but for which there is at least one major
corpus of historical data and/or current usage for the encoding by
users of the language(s) whose characters are encoded.

However, based on the file at

http://moztw.org/docs/big5/table/uao250-b2u.txt

a number of the mappings UAO defines are into the private use area.
This would generally preclude support (as this is a font-specific
encoding, not a Unicode encoding) unless the affected characters have
since been added to Unicode and could be remapped to the correct
codepoints. Do you know the status on this?

I'm also still unclear on whether this is a superset of HKSCS (it's
definitely not directly, but maybe it is if the PUA mappings are
corrected; I did not do any detaield checks but just noted the lack of
mappings to the non-BMP codepoints HKSCS uses).

> >Even with all the
> >above, the size of iconv.o will be under 130k, maybe closer to 110k.
> >If you actually use iconv in your program, this is a small price to
> >pay for having it fully functional. On the other hand, if linking it
> >is conditional, you have to consider who makes the decision, and when.
> >If it's at link time for each application, that's probably too much of
> >a musl-specific version.
> 
> Since statically linking libc-iconv is new area now (other libc
> doesn't touch this topic much), I think we can create standard for
> statically linking specified encoding table in link time.
> (This is also a reason of "why libc should provide an unique
> identifier with preprocessor define")

I don't see how "creating a standard" for doing this would make the
situation any better. Most software authors these days are at best
tolerant of the existing of static linking, and more often hostile to
it. They're not going to add specific build behavior for static
linking, and even if they do, they're likely to get it wrong, in which
case the user ends up stuck with binaries that can't process input in
their language.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-06 13:32       ` Rich Felker
@ 2013-08-06 15:11         ` Roy
  2013-08-06 16:22           ` Rich Felker
  0 siblings, 1 reply; 26+ messages in thread
From: Roy @ 2013-08-06 15:11 UTC (permalink / raw)
  To: musl

On Tue, 06 Aug 2013 21:32:05 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> On Tue, Aug 06, 2013 at 02:14:33PM +0800, Roy wrote:
>> >My impression (please correct me if I'm wrong) is that you can't use
>> >Big5-UAO as the system encoding on modern versions of Windows (just
>> >ancient ones where you install unmaintained third-party software that
>> >hacks the system charset tables)
>>
>> It doesn't "hack" the nls file but replaces with UAO-available CP950
>> nls file.
>> The executable(setup program) is generated with NSIS(Nullsoft
>> Scriptable Install System).
>> Since the nls file format doesn't change since NT 3.1 in 1993 till
>> now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will
>> continue to work in newer versions of windows unless MS throw away
>> nls file format with something different.
>
> OK, thanks for clarifying. I'd still consider it a ways into the
> "hack" domain if the OS vendor still is not supporting it directly,
> but it does make a difference that it still works "cleanly". I was
> under the impression that these sorts of things changes between
> Windows versions in ways that would preclude using old, unmaintained
> patches like this. I agree that just the fact that certain OS vendors
> do not support an encoding is not in itself a reason not to support
> it.
>
>> >and that it's not supported in GNU
>> >libiconv. If this is the case, and especially if Big5-UAO's main use
>> >is on a telnet-based BBS where everybody is using special telnet
>> >clients that have their own Big5-UAO converters,
>>
>> GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful
>> SBCS+DBCS)!
>>
>> So does it matter if GNU libiconv is not support whatever encodings?
>> (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC
>> SBCS and stateful SBCS+DBCS encodings)
>
> I was under the impression that GNU libiconv was in sync with glibc's
> iconv, but I have not checked this. I actually was more interested in
> glibc's, which is in widespread use. glibc's inclusion or exclusion of
> a feature is not in itself a reason to include or exclude it, but
> supporting something that glibc supports does have the added
> motivation that it will increase compatibility with what programs are
> expecting.
>
>> >I'd find it really
>> >hard to justify trying to support this. But I'm open to hearing
>> >arguments on why we should, if you believe it's important.
>>
>> I think it will be nice to have build/link time option for those
>> "unpopular" encodings.
>>
>> >>For static linking, can we have conditional linking like QT does?
>> >
>> >My feeling is that it's a tradeoff, and probably has more pros than
>> >cons. Unlike QT, musl's iconv is extremely small.
>>
>> I would add "right now" here. When we adds more encoding later,
>> iconv module will be bigger than now, and people will need to find a
>> way to conditionally compiling the encoding they need (for both
>> dynamically or statically)
>
> It's never been my intent to add more encodings later (aside from pure
> non-table-based variants of existing ones, like the ISO-2022 versions)
> once coverage is complete, at least not as built-in features. This can
> be discussed if you think there are reasons it needs to change, but up
> until now, the plan has been to support:
>
> - ISO-8859 based 8-bit encodings
> - Other 8-bit encodings with actual legacy usage (mainly Cyrillic)
> - JIS 0208 based encodings
> - KS X 1001 based encodings
> - GB 2312 and supersets
> - Big5 and supersets
>
> All of those except Big5 and supersets are now supported, so short of
> any change, my position is that right now we're discussing the "last"
> significant addition to musl's iconv.
>
> Some things that are definitely outside the scope of musl's iconv:
>
> - Anything whose characters are not present in Unicode
> - Anything PUA-based (really, same as above)
> - Newly invented encodings with no historical encoded data
>
> What's more borderline is where UAO falls: encodings that have neither
> governmental or language-body-authority support nor any vendor support
> from other software vendors, but for which there is at least one major
> corpus of historical data and/or current usage for the encoding by
> users of the language(s) whose characters are encoded.
>
> However, based on the file at
>
> http://moztw.org/docs/big5/table/uao250-b2u.txt
>
> a number of the mappings UAO defines are into the private use area.
> This would generally preclude support (as this is a font-specific
> encoding, not a Unicode encoding) unless the affected characters have
> since been added to Unicode and could be remapped to the correct
> codepoints. Do you know the status on this?

Those are Big5-2003 compatibility code range. Big5-2003 is in CNS11643  
appendix section, but it is rarely used since no OS/Application supports  
it.
So skipping the PUA mappings are fine.

>
> I'm also still unclear on whether this is a superset of HKSCS (it's
> definitely not directly, but maybe it is if the PUA mappings are
> corrected; I did not do any detaield checks but just noted the lack of
> mappings to the non-BMP codepoints HKSCS uses).

No it isn't. There is some code conflict between HKSCS(2001/2004) and UAO.

>
>> >Even with all the
>> >above, the size of iconv.o will be under 130k, maybe closer to 110k.
>> >If you actually use iconv in your program, this is a small price to
>> >pay for having it fully functional. On the other hand, if linking it
>> >is conditional, you have to consider who makes the decision, and when.
>> >If it's at link time for each application, that's probably too much of
>> >a musl-specific version.
>>
>> Since statically linking libc-iconv is new area now (other libc
>> doesn't touch this topic much), I think we can create standard for
>> statically linking specified encoding table in link time.
>> (This is also a reason of "why libc should provide an unique
>> identifier with preprocessor define")
>
> I don't see how "creating a standard" for doing this would make the
> situation any better. Most software authors these days are at best
> tolerant of the existing of static linking, and more often hostile to
> it. They're not going to add specific build behavior for static
> linking, and even if they do, they're likely to get it wrong, in which
> case the user ends up stuck with binaries that can't process input in
> their language.
>
> Rich



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-06 15:11         ` Roy
@ 2013-08-06 16:22           ` Rich Felker
  2013-08-07  0:54             ` Roy
  0 siblings, 1 reply; 26+ messages in thread
From: Rich Felker @ 2013-08-06 16:22 UTC (permalink / raw)
  To: musl

On Tue, Aug 06, 2013 at 11:11:23PM +0800, Roy wrote:
> >However, based on the file at
> >
> >http://moztw.org/docs/big5/table/uao250-b2u.txt
> >
> >a number of the mappings UAO defines are into the private use area.
> >This would generally preclude support (as this is a font-specific
> >encoding, not a Unicode encoding) unless the affected characters have
> >since been added to Unicode and could be remapped to the correct
> >codepoints. Do you know the status on this?
> 
> Those are Big5-2003 compatibility code range. Big5-2003 is in
> CNS11643 appendix section, but it is rarely used since no
> OS/Application supports it.
> So skipping the PUA mappings are fine.

OK, a few more questions...

1. What, if anything, is the accepted charset name for Big5-UAO, i.e.
how would it appear in MIME headers, etc.?

2. Can you give me an idea of the relationship between the Big5
variants/extensions/supersets? I'm aware of Windows CP950, HKSCS, and
now UAO. Is CP950 a common subset of them all, or is there a smaller
base subset "plain Big5" that's the only shared part? What is ETEN and
how does it fit in?

3. How should different MIME charset names be handled? In particular,
what does plain "Big5" refer to? Should it be interpreted as CP950?

4. Is there anywhere to get clean semi-authoritative sources for the
definitions of these charsets in plain text form. For HKSCS I found a
government PDF file but it's useless because you can't extract the
data in any meaningful way. Unicode has the CP950 file and "BIG5"
file, but the latter refers to Unicode 1.1 in the comments and I've
heard claims that it's completely wrong on many issues. Unihan.txt is
also fairly useless because it only defines the mappings for
ideographic characters, not the rest of the mappings in legacy CJK
encodings. Short of anything better I may just have to use glibc
output as a reference...

> >I'm also still unclear on whether this is a superset of HKSCS (it's
> >definitely not directly, but maybe it is if the PUA mappings are
> >corrected; I did not do any detaield checks but just noted the lack of
> >mappings to the non-BMP codepoints HKSCS uses).
> 
> No it isn't. There is some code conflict between HKSCS(2001/2004) and UAO.

Some conflict or heavy conflict? From an implementation standpoint, I
want to know if this is something where they could use a common table
plus "if (type==BIG5UAO) { /* fixups here */ ... }" or if they need
completely separate tables.

Rich

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-06 16:22           ` Rich Felker
@ 2013-08-07  0:54             ` Roy
  2013-08-07  7:20               ` Roy
  0 siblings, 1 reply; 26+ messages in thread
From: Roy @ 2013-08-07  0:54 UTC (permalink / raw)
  To: musl

On Wed, 07 Aug 2013 00:22:15 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> On Tue, Aug 06, 2013 at 11:11:23PM +0800, Roy wrote:
>> >However, based on the file at
>> >
>> >http://moztw.org/docs/big5/table/uao250-b2u.txt
>> >
>> >a number of the mappings UAO defines are into the private use area.
>> >This would generally preclude support (as this is a font-specific
>> >encoding, not a Unicode encoding) unless the affected characters have
>> >since been added to Unicode and could be remapped to the correct
>> >codepoints. Do you know the status on this?
>>
>> Those are Big5-2003 compatibility code range. Big5-2003 is in
>> CNS11643 appendix section, but it is rarely used since no
>> OS/Application supports it.
>> So skipping the PUA mappings are fine.
>
> OK, a few more questions...
>
> 1. What, if anything, is the accepted charset name for Big5-UAO, i.e.
> how would it appear in MIME headers, etc.?

No. Actually all Big5 variants uses "big5".

>
> 2. Can you give me an idea of the relationship between the Big5
> variants/extensions/supersets? I'm aware of Windows CP950, HKSCS, and
> now UAO. Is CP950 a common subset of them all, or is there a smaller
> base subset "plain Big5" that's the only shared part? What is ETEN and
> how does it fit in?

MS-CP950 can be considered as a common subset of HKSCS/UAO/ETEN etc.

Big5-ETEN mostly looks like CP950 but with Japanese Katakana/Hiragana area  
etc.

>
> 3. How should different MIME charset names be handled? In particular,
> what does plain "Big5" refer to? Should it be interpreted as CP950?

Since they use same MIME name, it depends on System codepage.
Some Hong Kong news websites still use Big5-HKSCS. For people using  
Internet Explorer with HKSCS installed, big5 MIME will map to  
Big5-HKSCS(or say, the only CP950 entry is mapped to CP951.nls which is  
HKSCS)
For Firefox users, they have to choose Big5-HKSCS by hand or by extension  
which checks domain name.

>
> 4. Is there anywhere to get clean semi-authoritative sources for the
> definitions of these charsets in plain text form. For HKSCS I found a
> government PDF file but it's useless because you can't extract the
> data in any meaningful way. Unicode has the CP950 file and "BIG5"
> file, but the latter refers to Unicode 1.1 in the comments and I've
> heard claims that it's completely wrong on many issues. Unihan.txt is
> also fairly useless because it only defines the mappings for
> ideographic characters, not the rest of the mappings in legacy CJK
> encodings. Short of anything better I may just have to use glibc
> output as a reference...

There is a documentation created by Mozilla Taiwan community:
http://moztw.org/docs/big5/
Google Translate:
http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=zh-TW&ie=UTF-8&u=http%3A%2F%2Fmoztw.org%2Fdocs%2Fbig5%2F

>
>> >I'm also still unclear on whether this is a superset of HKSCS (it's
>> >definitely not directly, but maybe it is if the PUA mappings are
>> >corrected; I did not do any detaield checks but just noted the lack of
>> >mappings to the non-BMP codepoints HKSCS uses).
>>
>> No it isn't. There is some code conflict between HKSCS(2001/2004) and  
>> UAO.
>
> Some conflict or heavy conflict? From an implementation standpoint, I
> want to know if this is something where they could use a common table
> plus "if (type==BIG5UAO) { /* fixups here */ ... }" or if they need
> completely separate tables.

Big5-HKSCS 2004 map for reference:
http://moztw.org/docs/big5/table/hkscs2004.txt
Use sed and awk to create b2u.txt for comparing:
$ sed -e '/^==/d' -e '1,2d' hkscs2004.txt| awk 'BEGIN{print "# big5  
unicode"}{print "0x" $1 " 0x" $4}' > hkscs2004-b2u.txt
In result:
http://roy.dnsd.me/hkscs2004-b2u.txt

And finally the diff:
http://roy.dnsd.me/uao250-hkscs2004.diff

The diff is huge so separated table is needed.

>
> Rich



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: Re: Re: iconv Korean and Traditional Chinese research so far
  2013-08-07  0:54             ` Roy
@ 2013-08-07  7:20               ` Roy
  0 siblings, 0 replies; 26+ messages in thread
From: Roy @ 2013-08-07  7:20 UTC (permalink / raw)
  To: musl

On Wed, 07 Aug 2013 08:54:35 +0800, Roy <roytam@gmail.com> wrote:

[snip]
>
> Big5-HKSCS 2004 map for reference:
> http://moztw.org/docs/big5/table/hkscs2004.txt
> Use sed and awk to create b2u.txt for comparing:
> $ sed -e '/^==/d' -e '1,2d' hkscs2004.txt| awk 'BEGIN{print "# big5  
> unicode"}{print "0x" $1 " 0x" $4}' > hkscs2004-b2u.txt
> In result:
> http://roy.dnsd.me/hkscs2004-b2u.txt
>
> And finally the diff:
> http://roy.dnsd.me/uao250-hkscs2004.diff
>
> The diff is huge so separated table is needed.

I forgot that the HKSCS table has original CP950 entries missing.
$ cat cp950-b2u.txt hkscs2004-b2u.txt | sed -e '1d'|sort >  
hkscs2004-big5-b2u.txt

And I wrote a small utility in PHP to compare 2 tables by keys(first  
column):
http://roy.dnsd.me/tbldiff.phps

$ php tbldiff.php uao250-b2u.txt hkscs2004-big5-b2u.txt >  
uao250-vs-hkscs2004.txt
http://roy.dnsd.me/uao250-vs-hkscs2004.txt

$ sed -e '/==/d' uao250-vs-hkscs2004.txt > uao250-hkscs2004-diff.txt
http://roy.dnsd.me/uao250-hkscs2004-diff.txt

So 5965 mappings are different, including 1379 mappings does not exist in  
HKSCS2004.

But since there is mix-usage of HKSCS2001/2004 in both local files and  
Internet pages, the condition of HKSCS become worse.

BTW, There is another NLS hack that hacks MS-CP932 to support JIS X  
0213:2004
http://www.eonet.ne.jp/~kotobukispace/ddt/jisx0213/jisx0213.html



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-08-07  7:20 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-04 16:51 iconv Korean and Traditional Chinese research so far Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker
2013-08-05  5:00 ` Rich Felker
2013-08-05  8:28 ` Roy
2013-08-05 15:43   ` Rich Felker
2013-08-05 17:31     ` Rich Felker
2013-08-05 19:12   ` Rich Felker
2013-08-06  6:14     ` Roy
2013-08-06 13:32       ` Rich Felker
2013-08-06 15:11         ` Roy
2013-08-06 16:22           ` Rich Felker
2013-08-07  0:54             ` Roy
2013-08-07  7:20               ` Roy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).