Re: iconv Korean and Traditional Chinese research so far

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Re: iconv Korean and Traditional Chinese research so far
       [not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net>
@ 2013-08-05 12:46 ` Rich Felker
  0 siblings, 0 replies; 18+ messages in thread
From: Rich Felker @ 2013-08-05 12:46 UTC (permalink / raw)
  To: musl

On Sun, Aug 04, 2013 at 11:28:16PM -0700, writeonce@midipix.org wrote:
>      -------- Original Message --------
>      Subject: Re: [musl] iconv Korean and Traditional Chinese research so far
>      From: Rich Felker <[1]dalias@aerifal.cx>
>      Date: Sun, August 04, 2013 10:00 pm
>      To: [2]musl@lists.openwall.com
>      Being that HKSCS is a standard, registered MIME charset and the cost
>      is only 10k, and that it seems necessary for real world usage in Hong
>      Kong, I think it's pretty obvious that we should support it. So I
>      think the question we're left with is whether the CP949 (MS encoding)
>      extension for Korean is important to support. The cost is roughly 37k.
> 
>    In case that helps: Korean typesetting packages for
>    TeX/LaTeX/XeLaTeX/LuaLaTeX do provide support for CP949.  This includes
>    the most recent versions; see, for instance:
>    [3]http://osl.ugr.es/CTAN/macros/luatex/generic/luatexko/luatexko-uhc2utf8.lua
>    There are also several packages that support UHC, which allegedly overlaps
>    with CP949:
>    [4]http://www.ctan.org/topic/korean
>    Best regards,
>    zg

Thanks. I've also seen a fair number of bug reports asking for
programs that lack it to add support, so it seems there is at least
some demand. I'm going to go ahead and look into whether there's a way
we could do slow but algorithmic conversion (e.g. are all the
characters outside the standard range allocated sequentually based on
the set missing from the standard range?) since that would make the
question of whether or not it's needed fairly irrelevant.

Rich


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  8:24           ` Justin Cormack
@ 2013-08-05 14:43             ` Rich Felker
  0 siblings, 0 replies; 18+ messages in thread
From: Rich Felker @ 2013-08-05 14:43 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 09:24:37AM +0100, Justin Cormack wrote:
> They are not going to be "fixed" just don't build them. It is not hard with
> Musl. Just add this into your build script.

Indeed. My intent is for it to be fully-functional-as-shipped. If
somebody needs to cripple certain interfaces to meet extreme size
requirements, that's an ok local modification, and it might even be
acceptable as a configure option if enough people legitimately request
it.

> One of the nice features of Musl is that it appeals to a broader audience
> than just "embedded" so it is always going to have stuff you can cut out if
> you want absolute minimalism but this means it will get wider usage.

Cutting out math/*, complex/*, and most of crypt/* would save at least
as much space as iconv, and there are plenty of places these aren't
needed either. It's not for me to decide which options you can omit.
Thankfully, due to musl's correct handling of static linking, you
usually don't have to think about it either. You just static link and
get only what you need.

> Adding external files has many disadvantages to other people. If you don't
> want these conversions external files do not help you.

External files also do not make things work "by default". They only
work if musl has been installed system-wide according to our
directions (which not everbody will follow) or if the user has done
the research to figure out how to work around it not being installed
system-wide.

> Making software for more than one person involves compromises so please
> calm down a bit. Use your own embedded build with the parts you don't need
> omitted.

Exactly. Where musl excels here is by not _forcing_ you to use iconv.
I take great care not to force linking of components you might not
want to see in your output binary size, and for TLS, which
unfortunately was misdesigned in such a way that the linker can't see
if TLS is used or not for the purpose of deciding whether to link the
TLS init code, I went to great lengths both to minimize the size of
__init_tls.o and to make it easy, as a local customization, to omit
this module. But as an analogy, I would not have even considered
asking musl users who need TLS to add special CFLAGS, libraries, etc.
when building programs. That's an unreasonable burden and it's broken
because it does not "work by default".

Rich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  7:53         ` Harald Becker
  2013-08-05  8:24           ` Justin Cormack
@ 2013-08-05 14:35           ` Rich Felker
  1 sibling, 0 replies; 18+ messages in thread
From: Rich Felker @ 2013-08-05 14:35 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 09:53:32AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> > iconv is not something that needs to be extensible. There is a
> > finite set of legacy encodings that's relevant to the world,
> > and their relevance is going to go down and down with time, not
> > up.
> 
> Oh! So you consider Japanese, Chinese, Korean, etc. languages
> relevant for programs sitting on my machines? How can you decide

I don't decide what's relevant for you. Rather, I don't have the
authority to declare it irrelevant-by-default. This is true even for
things like crypt algorithms (does anybody really want to use md5??)
but especially for anything that would preclude somebody from being
able to receive data in their native language. Simple multilingual
support via UTF-8 with conversion from legacy data has been near top
priority, if not top, since the conception of musl.

If history has shown us anything, it's that universal support for all
languages must be default and turning off some support to save space
(which is rarely if ever actually needed) needs to be a conscious
decision. I'm no Apple fan by any means, but just look at the
situation on iOS: you can turn on a new iPhone or iPad and read data
in any language (including having the relevant fonts!) and even add a
keyboard and type in almost any language, without having to buy a
special localized version or install add-ons. This is very different
from the situation on Android right now.

musl's intended applicability is broad. From industrial control to
settop boxes, in-car entertainment, initramfs images for desktop
machines, phones, tablets, plug computers that run your private home
or office webmail server, full desktops, VE LAMP stacks, hosts for
VEs, etc. Some of these usages have a real need for human-language
text; others don't. But if we have the power to make it such that, if
someone uses a musl to implement a plug computer for webmail, it
naturally supports all languages unless the maker of the device goes
and actively rips that support out, then we have a responsibility to
do so. Or, said differently, it's OUR FAULT for making
broken-by-default software if language support is missing unless you
go to the effort of learning musl-specific ways to enable it.

> this? Why being so ignorant and trying to write an standard
> conform library and then pick out a list of char sets of your
> choice which may be possible on iconv, neglecting wishes and
> need of any musl user.

If I were to just accept your demands, it would essentially mean:

(1) discarding the opinions of everybody else who discussed this issue
in the past and decided that static linking should mean real static
binaries that work the same without needing extra files in the
filesystem..

(2) discarding the informed decisions I made based on said
discussions.

> .... or in other words, if you really be this ignorant and
> insist on including those charsets fixed in musl, musl is never
> more for me :( ... I don't need to bring in any part of mine into
> musl, but I don't consider a lib usable for my needs, which
> include several char set files in statical build and neglects to
> load seldom used charset definitions from extern in any way.

Name the extra "seldom used charset definitions" you're interested in.
They're probably already supported. We are not discussing adding some
new giant subsystem to musl. We are discussing adding the last two
missing major legacy charsets to an existing framework that's existed
for a long time.

> > > > Do I want to give users who have large volumes of legacy
> > > > text in their languages stored in these encodings the same
> > > > respect and dignity as users of other legacy encodings we
> > > > already support? Yes.
> > > 
> > > Of course. I won't dictate others which conversions they want
> > > to use. I only hat to have plenty of conversion tables on my
> > > system when I really know I never use such kind of
> > > conversions.
> > 
> > And your table for just Chinese is as large as all our tables
> > combined...
> 
> How can you tell this. I don't think so.

You're welcome to implement it and see. Thanks to the way static
linking works, if you add -lyouriconv when static linking, the iconv
in musl will be completely omitted from the binary and yours will be
used instead. Of course the iconv in musl will be completely omitted
anyway except in the small number of programs that actually use iconv.
This is not glibc where stdio and locale depend on iconv. iconv is
purely iconv.

> Such conversion codes
> may be very compact. Size is mainly required for translation
> tables, that is when code points of the char sets does not match
> Unicode character order, but you always need the space for those
> translations. The rest won't be much.

That's all the size. The VAST majority of the table size is for 4
major character encoding families, those based on:

- JIS 0208
- GB 18030
- KS X 1001
- Big5

As for legacy 8-bit encodings, musl's approach to them is also more
efficient than you could easily be with a state machine. The fact that
the number of codepoints that ever appear in an 8-bit encoding is less
than 1024 is used to store the mappings as 10-bit-per-entry packed
arrays of indices into the legacy_chars table. This reduces the
marginal cost of individual 8bit encodings by 25% (versus 16-bit
entries). The ASCII range and any span upward into the high range that
maps directly to Unicode codepoints is also elided from the table
(which reduces ISO-8859-* by another 62.5%).

In short, what we have is about the smallest possible representation
you can get without applying LZMA or something (and thereby needing
all the code to decompress and dirty pages to store the decompressed
version). It's hard to beat.

By the way, if you really want to save the space they take, you could
just delete this email thread from your mail folder. It's larger than
musl's iconv already. :-)

> > I agree you can make iconv smaller than musl's in the case
> > where _no_ legacy DBCS are installed. But if you have just one,
> > you'll be just as large or larger than musl with them all.
> 
> .... musl with them all? I don't consider them smaller than an
> optimized byte code interpreter ... not when you are going to
> include DBCS char sets fixed into musl. At least if you do all
> the required translations.

I may have been exaggerating a little bit, but I doubt you can get
your bytecode GB18030 support smaller than about 110k once you count
the bytecode and the interpreter binary. I'm even more doubtful that
you can get it smaller than the current 71k in musl.

> > compare the size of musl's tables to glibc's converters. I've
> > worked hard to make them as small as reasonably possible
> > without doing hideous hacks like decompression into an
> > in-memory buffer, which would actually increase bloat.
> 
> Are you now going to build a lib for startup purpose and embedded
> systems only or are you trying to write a general purpose
> library?

General-purpose. Have you not read the website?

    Originally in the 1990s, Linux-based systems used a fork of the
    GNU C library (glibc) version 1, which existed in various versions
    (libc4, libc5). Later, distributions adopted the more mature
    version 2 of glibc, and denoted it libc6. Since then, other
    specialized C library implementations such as uClibc and dietlibc
    have emerged as well.

    musl is a new general-purpose implementation of the C library. It
    is lightweight, fast, simple, free, and aims to be correct in the
    sense of standards-conformance and safety.

If you're using it for startup purposes or embedded systems that don't
communicate with humans in human language, you won't be running
applications that call iconv() and thus it's irrelevant.

> On one hand you say "use dietlibc" if you need small statical
> programs and on the other hand you want to include many charset
> definitions into a statical build to avoid dynamic loading of
> tables, required only on embedded systems.

Where did I say "use dietlibc"? If I did (I don't really remember) it
was not a serious recommendation but a sarcastic remark to make a
point that musl is not about being "smallest-at-all-costs" (and
thereby broken) like dietlibc is.

> > have been over in 1992, when Pike and Thompson made them
> > obsolete, but it's really over now.
> 
> So why are you adding Japanese, Chinese and Korean charsets to an
> iconv conversion in musl? Why not just using UTF-8? Whenever you
> use iconv you want the flexibility to do all required charset
> conversions. Which means you need to statically link in many
> charset definitions or you need to dynamically load what is
> required.

The time of creating charsets is over. That does not magically make
the data created in those charsets in the past go away or convert
itself to UTF-8. It doesn't even magically stop people from making new
data in those charsets. All it means is that governments, vendors,
etc. have stopped the madness of making new charsets.

> > Then dynamic link it. If you want an extensible binary, you use
> > dynamic linking.
> 
> Dynamic linking of mail client, ok and where go the charset
> definition files? Are they all packed into your libc.so? That is
> a very big file? Why do I need to have Asian language definition
> on my disk, when I do not want?

Because any other solution would be larger, would defeat the purpose
of static linking, and would contribute to the problem of poor
multilingual support. Why are you upset about these tables and not
other tables like crypto sboxes, wcwidth, character classes, bits of
2/pi and pi/2, etc.? By the way, math/*.o are also fairly large, on
the same order of magnitude as iconv; would you also suggest we move
it all out to bytecode loaded at runtime even in static binaries?

> It is your decision, but please state clear what purpose you are
> building musl. Here it looks you are mixing things and steping in
> a direction I will never like.

This has all been documented all along. I'm sorry you don't understand
the goals of the project. Perhaps your misunderstanding is what
"general purpose" means. It does not mean we omit anything that could
offend anyone by wasting a few bytes on their hard drive. It means we
don't cut corners that break important usage cases. Having a complete
iconv linked whenever you link a program using iconv() does not break
your usage case unless you have less than 100k of disk/ssd/rom storage
to spare, and in that case, you probably shouldn't be using iconv. If
anyone ever does have a practical difficulty because of this, rather
than theoretical complaints based on anglocentricism, eurocentricism,
and/or xenophobia, I am not entirely opposed to making a build option
to omit iconv tables, but it has to be well-motivated.

Rich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  7:03         ` Harald Becker
@ 2013-08-05 12:54           ` Rich Felker
  0 siblings, 0 replies; 18+ messages in thread
From: Rich Felker @ 2013-08-05 12:54 UTC (permalink / raw)
  To: musl; +Cc: nsz

On Mon, Aug 05, 2013 at 09:03:43AM +0200, Harald Becker wrote:
> The only code that get a bit more, is the file system search.
> This depends if we only try single location or walk through a
> search path list. But this is the cost of flexibility to
> dynamically load character set conversions (which I would really
> prefer for seldom used char sets).

The only "seldom used char sets" are either extremely small (8bit
codepages) or simply encoding variants of an existing CJK DBCS (in
which case it's just a matter of code, not large data tables, to
support them).

> .... and for application writer it is only more, if he likes to
> add some charset tables into his program, which are not in
> statical libc.

This is only helpful if the application writer is designing around
musl. This is a practice we explicitly discourage.

> The problem is, all tables in libc need to be linked to your
> program, if you include iconv. So each added charset conversion
> increases size of your program ... and I definitly won't include
> Japanese, Chinese or Korean charsets in my program. No that I
> ignore those peoples need, I just wont need it, so I don't like
> to add those conversions to programs sitting on my disk.

How many programs do you intend to use iconv in that _don't_ need to
support arbitrary encodings including ones you might not be using
yourself? Even if you don't read Korean, if a Korean user sends you an
email containing non-ASCII punctuation, Greek letters like epsilon,
etc. there's a fair chance their MUA will choose to encode with a
legacy Korean encoding rather than UTF-8, and then you need the
conversion.

It would be nice if everybody encoded everything in UTF-8 so the
recipient was not responsible for supporting a wide range of legacy
encodings, but that's not the reality today.

> If the definition of the iconv virtual state machine is modified,
> you need to do extra care on update (delete old charset files,
> install new lib, install new charset files, restart system) ...
> but this is only required on a major update. As soon as the

Even if there were really good reasons for the design you're
proposing, such a violation of the stability and atomic upgrade policy
would require a strong overriding justification. We don't have that
here.

Rich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 16:51 Rich Felker
                   ` (2 preceding siblings ...)
  2013-08-05  5:00 ` Rich Felker
@ 2013-08-05  8:28 ` Roy
  3 siblings, 0 replies; 18+ messages in thread
From: Roy @ 2013-08-05  8:28 UTC (permalink / raw)
  To: musl

Since I'm a Traditional Chinese and Japanese legacy encoding user, I think  
I can say something here.

Mon, 05 Aug 2013 00:51:52 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> OK, so here's what I've found so far. Both legacy Korean and legacy
> Traditional Chinese encodings have essentially a single base character
> set:
>

>
> Traditional Chinese:
> Big5 (CP950)
> 89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
> All characters in BMP
> 27946 bytes table space
>
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
>
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
>

There is another Big5 extension called Big5-UAO, which is being used in  
world's largest telnet-based BBS called "ptt.cc".

It has two tables, one for Big5-UAO to Unicode, another one is Unicode to  
Big5-UAO.
http://moztw.org/docs/big5/table/uao250-b2u.txt
http://moztw.org/docs/big5/table/uao250-u2b.txt

Which extends DBCS lead byte to 0x81.

> The big remaining questions are:
>
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).

For Big5-UAO, it contains Japanese and Simplified Chinese characters which  
do not exist in original MS-CP950 implementation.

>
> 2. Are there patterns to exploit? For Korean, ALL of the Hangul
> characters are actually combinations of several base letters. Unicode
> encodes them all sequentially in a pattern where the conversion to
> their constitutent letters is purely algorithmic, but there seems to
> be no clean pattern in the legacy encodings, as the encodings started
> out just incoding the "important" ones then adding less important
> combinations in separate ranges.

In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji characters in  
Japanese) and Japanese Katakana/Hiragana besides of Hangul characters.

>
> Worst-case, adding Korean and Traditional Chinese tables will roughly
> double the size of iconv.o to around 150k. This will noticably enlarge
> libc.so, but will make no difference to static-linked programs except
> those using iconv. I'm hoping we can make these additions less
> expensive, but I don't see a good way yet.

For static linking, can we have conditional linking like QT does?
In QT static linking, it uses Q_IMPORT_PLUGIN to include CJK codec tables.

#ifndef QT_SHARED
     #include <QtPlugin>

     Q_IMPORT_PLUGIN(qcncodecs)
     Q_IMPORT_PLUGIN(qjpcodecs)
     Q_IMPORT_PLUGIN(qkrcodecs)
     Q_IMPORT_PLUGIN(qtwcodecs)
#endif


>
> At some point, especially if the cost is not reduced, I will probably
> add build-time options to exclude a configurable subset of the
> supported character encodings. This would not be extremely
> fine-grained, and the choices to exclude would probably be just:
> Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
> 8-bit might also be an option but these are so small I can't think of
> cases where it would be beneficial to omit them (5k for the tables on
> top of the 2k of actual code in iconv). Perhaps if there are cases
> where iconv is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices, dropping the
> 8-bit tables and all of the support code could be useful; the
> resulting iconv would be around 1k, I think.
>
> Rich
>

HTH,
Roy



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  7:53         ` Harald Becker
@ 2013-08-05  8:24           ` Justin Cormack
  2013-08-05 14:43             ` Rich Felker
  2013-08-05 14:35           ` Rich Felker
  1 sibling, 1 reply; 18+ messages in thread
From: Justin Cormack @ 2013-08-05  8:24 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 6597 bytes --]

On 5 Aug 2013 08:53, "Harald Becker" <ralda@gmx.de> wrote:
>
> Hi Rich !
>
> > iconv is not something that needs to be extensible. There is a
> > finite set of legacy encodings that's relevant to the world,
> > and their relevance is going to go down and down with time, not
> > up.
>
> Oh! So you consider Japanese, Chinese, Korean, etc. languages
> relevant for programs sitting on my machines? How can you decide
> this? Why being so ignorant and trying to write an standard
> conform library and then pick out a list of char sets of your
> choice which may be possible on iconv, neglecting wishes and
> need of any musl user.
>
> ... or in other words, if you really be this ignorant and
> insist on including those charsets fixed in musl, musl is never
> more for me :( ... I don't need to bring in any part of mine into
> musl, but I don't consider a lib usable for my needs, which
> include several char set files in statical build and neglects to
> load seldom used charset definitions from extern in any way.

They are not going to be "fixed" just don't build them. It is not hard with
Musl. Just add this into your build script.

One of the nice features of Musl is that it appeals to a broader audience
than just "embedded" so it is always going to have stuff you can cut out if
you want absolute minimalism but this means it will get wider usage.

Adding external files has many disadvantages to other people. If you don't
want these conversions external files do not help you.

Making software for more than one person involves compromises so please
calm down a bit. Use your own embedded build with the parts you don't need
omitted.

Justin

> >
> > > > Do I want to give users who have large volumes of legacy
> > > > text in their languages stored in these encodings the same
> > > > respect and dignity as users of other legacy encodings we
> > > > already support? Yes.
> > >
> > > Of course. I won't dictate others which conversions they want
> > > to use. I only hat to have plenty of conversion tables on my
> > > system when I really know I never use such kind of
> > > conversions.
> >
> > And your table for just Chinese is as large as all our tables
> > combined...
>
> How can you tell this. I don't think so. Such conversion codes
> may be very compact. Size is mainly required for translation
> tables, that is when code points of the char sets does not match
> Unicode character order, but you always need the space for those
> translations. The rest won't be much.
>
> > I agree you can make iconv smaller than musl's in the case
> > where _no_ legacy DBCS are installed. But if you have just one,
> > you'll be just as large or larger than musl with them all.
>
> ... musl with them all? I don't consider them smaller than an
> optimized byte code interpreter ... not when you are going to
> include DBCS char sets fixed into musl. At least if you do all
> the required translations.
>
> > compare the size of musl's tables to glibc's converters. I've
> > worked hard to make them as small as reasonably possible
> > without doing hideous hacks like decompression into an
> > in-memory buffer, which would actually increase bloat.
>
> Are you now going to build a lib for startup purpose and embedded
> systems only or are you trying to write a general purpose
> library? Including all those definitions in a statical build is
> definitely not the way I will ever like. This may be done for
> some special situations and selected char sets, but not for a
> general purpose library, claiming to get a wide usage.
>
> > If you have root or want to setup nonstandard environment
> > variables.
>
> What about a charset searchpath including something like
> "~/.local/share/charset". This would allow to install charset
> files in the users directory.
>
> > > interpreter allows to statical link in the conversion byte
> > > code programs.
> >
> > At several times the size of the current code/tables, and after
> > the user searches through the documentation to figure out how
> > to do it.
>
> You definitely consider to include all those code tables
> statically into musl? I won't include much more than some
> standard sets. Why don't you want to load the charset definitions
> as they are required?
>
> On one hand you say "use dietlibc" if you need small statical
> programs and on the other hand you want to include many charset
> definitions into a statical build to avoid dynamic loading of
> tables, required only on embedded systems.
>
> So what's the purpose of musl? I don't think you stay right here.
>
> > It's not just a matter of dropping in. You'd have path searches
> > to modify or disable, build options to get the static tables
> > turned on, and all of this stuff would have to be integrated
> > with the build system for what you're dropping it into.
>
> I don't see the required complexity. In fact I won't have a lib
> that includes several charset definitions in a statical build. I
> really like to have a directory with definition files for those
> char sets and don't see the complexity for this you proclamate.
>
> Inclusion in statical build is not more than selection of the
> charsets you want o be included statically. This selection is
> always required or you include all files , which I definitly
> neglect.
>
> > Complexity is never the solution. Honestly, I would take a 1mb
> > increase in binary size over this kind of complexity any day.
> > Thankfully, we don't have to make such a tradeoff.
>
> The only complexity which we has here is the complexity of
> charset translation. The rest is relatively simple.
>
> > Charsets are not added. The time of charsets is over. It should
> > have been over in 1992, when Pike and Thompson made them
> > obsolete, but it's really over now.
>
> So why are you adding Japanese, Chinese and Korean charsets to an
> iconv conversion in musl? Why not just using UTF-8? Whenever you
> use iconv you want the flexibility to do all required charset
> conversions. Which means you need to statically link in many
> charset definitions or you need to dynamically load what is
> required.
>
> > Then dynamic link it. If you want an extensible binary, you use
> > dynamic linking.
>
> Dynamic linking of mail client, ok and where go the charset
> definition files? Are they all packed into your libc.so? That is
> a very big file? Why do I need to have Asian language definition
> on my disk, when I do not want?
>
> It is your decision, but please state clear what purpose you are
> building musl. Here it looks you are mixing things and steping in
> a direction I will never like.
>
> --
> Rich

[-- Attachment #2: Type: text/html, Size: 8041 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  3:39       ` Rich Felker
@ 2013-08-05  7:53         ` Harald Becker
  2013-08-05  8:24           ` Justin Cormack
  2013-08-05 14:35           ` Rich Felker
  0 siblings, 2 replies; 18+ messages in thread
From: Harald Becker @ 2013-08-05  7:53 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich !

> iconv is not something that needs to be extensible. There is a
> finite set of legacy encodings that's relevant to the world,
> and their relevance is going to go down and down with time, not
> up.

Oh! So you consider Japanese, Chinese, Korean, etc. languages
relevant for programs sitting on my machines? How can you decide
this? Why being so ignorant and trying to write an standard
conform library and then pick out a list of char sets of your
choice which may be possible on iconv, neglecting wishes and
need of any musl user.

... or in other words, if you really be this ignorant and
insist on including those charsets fixed in musl, musl is never
more for me :( ... I don't need to bring in any part of mine into
musl, but I don't consider a lib usable for my needs, which
include several char set files in statical build and neglects to
load seldom used charset definitions from extern in any way.

> 
> > > Do I want to give users who have large volumes of legacy
> > > text in their languages stored in these encodings the same
> > > respect and dignity as users of other legacy encodings we
> > > already support? Yes.
> > 
> > Of course. I won't dictate others which conversions they want
> > to use. I only hat to have plenty of conversion tables on my
> > system when I really know I never use such kind of
> > conversions.
> 
> And your table for just Chinese is as large as all our tables
> combined...

How can you tell this. I don't think so. Such conversion codes
may be very compact. Size is mainly required for translation
tables, that is when code points of the char sets does not match
Unicode character order, but you always need the space for those
translations. The rest won't be much.

> I agree you can make iconv smaller than musl's in the case
> where _no_ legacy DBCS are installed. But if you have just one,
> you'll be just as large or larger than musl with them all.

... musl with them all? I don't consider them smaller than an
optimized byte code interpreter ... not when you are going to
include DBCS char sets fixed into musl. At least if you do all
the required translations.

> compare the size of musl's tables to glibc's converters. I've
> worked hard to make them as small as reasonably possible
> without doing hideous hacks like decompression into an
> in-memory buffer, which would actually increase bloat.

Are you now going to build a lib for startup purpose and embedded
systems only or are you trying to write a general purpose
library? Including all those definitions in a statical build is
definitely not the way I will ever like. This may be done for
some special situations and selected char sets, but not for a
general purpose library, claiming to get a wide usage.

> If you have root or want to setup nonstandard environment
> variables.

What about a charset searchpath including something like
"~/.local/share/charset". This would allow to install charset
files in the users directory.

> > interpreter allows to statical link in the conversion byte
> > code programs.
> 
> At several times the size of the current code/tables, and after
> the user searches through the documentation to figure out how
> to do it.

You definitely consider to include all those code tables
statically into musl? I won't include much more than some
standard sets. Why don't you want to load the charset definitions
as they are required?

On one hand you say "use dietlibc" if you need small statical
programs and on the other hand you want to include many charset
definitions into a statical build to avoid dynamic loading of
tables, required only on embedded systems.

So what's the purpose of musl? I don't think you stay right here.

> It's not just a matter of dropping in. You'd have path searches
> to modify or disable, build options to get the static tables
> turned on, and all of this stuff would have to be integrated
> with the build system for what you're dropping it into.

I don't see the required complexity. In fact I won't have a lib
that includes several charset definitions in a statical build. I
really like to have a directory with definition files for those
char sets and don't see the complexity for this you proclamate.

Inclusion in statical build is not more than selection of the
charsets you want o be included statically. This selection is
always required or you include all files , which I definitly
neglect.

> Complexity is never the solution. Honestly, I would take a 1mb
> increase in binary size over this kind of complexity any day.
> Thankfully, we don't have to make such a tradeoff.

The only complexity which we has here is the complexity of
charset translation. The rest is relatively simple.

> Charsets are not added. The time of charsets is over. It should
> have been over in 1992, when Pike and Thompson made them
> obsolete, but it's really over now.

So why are you adding Japanese, Chinese and Korean charsets to an
iconv conversion in musl? Why not just using UTF-8? Whenever you
use iconv you want the flexibility to do all required charset
conversions. Which means you need to statically link in many
charset definitions or you need to dynamically load what is
required.

> Then dynamic link it. If you want an extensible binary, you use
> dynamic linking.

Dynamic linking of mail client, ok and where go the charset
definition files? Are they all packed into your libc.so? That is
a very big file? Why do I need to have Asian language definition
on my disk, when I do not want?

It is your decision, but please state clear what purpose you are
building musl. Here it looks you are mixing things and steping in
a direction I will never like.

--
Rich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  3:13       ` Szabolcs Nagy
@ 2013-08-05  7:03         ` Harald Becker
  2013-08-05 12:54           ` Rich Felker
  0 siblings, 1 reply; 18+ messages in thread
From: Harald Becker @ 2013-08-05  7:03 UTC (permalink / raw)
  Cc: musl, nsz

Hi !

05-08-2013 05:13 Szabolcs Nagy <nsz@port70.net>:

> * Harald Becker <ralda@gmx.de> [2013-08-05 03:24:52 +0200]:
> > iconv then shall:
> > - look for some fixed charsets like ASCII, Latin-1, UTF-8,
> > etc.
> > - search table of with libc linked charsets
> > - search table of with the program linked charsets
> > - search for charset on external search path
> 
> sounds like a lot of extra management cost
> (for libc, application writer and user as well)

This is not so much work. You already need to search for the
character set table to use, that is you need to search at least a
table of string values to find the pointer to the conversion
table. Searching a table in may statement above, means just
waking a pointer chain doing the string compares to find a
matching character set. Not much difference to the really
required code. Now do this twice to check possible user chain,
is just one more helper function call.

The only code that get a bit more, is the file system search.
This depends if we only try single location or walk through a
search path list. But this is the cost of flexibility to
dynamically load character set conversions (which I would really
prefer for seldom used char sets).

... and for application writer it is only more, if he likes to
add some charset tables into his program, which are not in
statical libc.

The problem is, all tables in libc need to be linked to your
program, if you include iconv. So each added charset conversion
increases size of your program ... and I definitly won't include
Japanese, Chinese or Korean charsets in my program. No that I
ignore those peoples need, I just wont need it, so I don't like
to add those conversions to programs sitting on my disk.

> it would be nice if the compiler could figure out
> at build time (eg with lto) which tables are used
> but i guess charsets often only known at runtime

How do you want to do this? And how shall the compiler know which
char sets the user may use during operation? So the only way to
select the charset tables to include in your program, is by
assuming ahead, which tables might be used. That is part of the
configuration of musl build or application program build.

> > [Addendum after thinking a bit more: The byte code conversion
> > files shall exist of a small statical header, followed by the
> > byte code program. The header shall contain the charset name,
> > version of required virtual machine and length of byte code.
> > So you need only add all such conversion files to a big array
> > of bytes and add a Null header to mark the end of table. Then
> > you only need the start of the array and you are able to
> > search through for a specific charset. The iconv function in
> > libc contains a definition for an "unsigned char const
> > *iconv_user_charsets = NULL;", which is linked in, when the
> > user does not provide it's own definition. So iconv can
> > search all linked in charset definitions, and need no code
> > changes. Really simple configuration to select charsets to
> > build in.]
> > 
> 
> yes that can work, but it's a musl specific hack
> that the application programmer need to take care of

Only if application programmer wants to add a char set to the
statical build program, which is not in libc, some extra work
has to be done. Giving some more flexibility. If you don't care,
you get the musl build in list of char sets.

> > > if the format changes then dynamic linking is
> > > problematic as well: you cannot update libc
> > > in a single atomic operation
> > 
> > The byte code shall be independent of dynamic linking. The
> > conversion files are only streams of bytes, which shall also
> > be architecture independent. So you do only need to update the
> > conversion files if the virtual machine definition of iconv
> > has been changed (shall not be done much). External files may
> > be read into malloc-ed buffers or mmap-ed, not linked in by
> > the dynamical linker.
> > 
> 
> that does not solve the format change problem
> you cannot update libc without race
> (unless you first replace the .so which supports
> the old format as well as the new one, but then
> libc has to support all previous formats)

If the definition of the iconv virtual state machine is modified,
you need to do extra care on update (delete old charset files,
install new lib, install new charset files, restart system) ...
but this is only required on a major update. As soon as the
virtual machine definition gots stabilized you do not need to
change charset definition files, or just do update your lib, then
update possible new charset files. After an initial phase of
testing this shall happen relatively seldom, that the virtual
machine definition needs to be changed in an incompatible manner.
And simple extending the virtual machine does not invalidate the
old charset files.

> it's probably easy to design a fixed format to
> avoid this

A fixed format? For what? Do you know the differences of char
sets, especially multi byte char sets?

> it seems somewhat similar to the timezone problem
> ecxept zoneinfo is maintained outside of libc so
> there is not much choice, but there are the same
> issues: updating it should be done carefully,
> setuid programs must be handled specially etc

Again. As soon as the virtual machine definition has reached a
stable state, it shall not happen much, that any change
invalidates a charset definition file. That is at least old files
will continue to work with newer lib versions. So there is no
problem on update, just update your lib then update your charset
files. The only problem will be, if a still running application
uses a new charset file with an old version of the lib. This will
be detected and leads to a failure code of iconv. So you need to
restart your application ... which is always a good decision as
you updated your lib.

--
Harald

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 16:51 Rich Felker
  2013-08-04 22:39 ` Harald Becker
  2013-08-05  0:46 ` Harald Becker
@ 2013-08-05  5:00 ` Rich Felker
  2013-08-05  8:28 ` Roy
  3 siblings, 0 replies; 18+ messages in thread
From: Rich Felker @ 2013-08-05  5:00 UTC (permalink / raw)
  To: musl

On Sun, Aug 04, 2013 at 12:51:52PM -0400, Rich Felker wrote:
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
> 
> Korean:
> CP949
> Lead byte range is extended to 81-FD (125)
> Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
> 44500 bytes table space
> 
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
> 
> The big remaining questions are:
> 
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).

For what it's worth, there is no IANA charset registration for any
supplement to Korean. See the table here:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

The only entries for Korean are ISO-2022-KR and EUC-KR.

Big5-HKSCS however is registered. This matches my intuition that, of
the two, HKSCS would be more important to real-world usage than Korean
extensions.

If we were to omit CP949 and just go with KS X 1001, but include
HKSCS, the total size (minus a minimal amount of code needed) would be
17484+37366 = 54850.

With both supported, it would be 44500+37366 = 81866.

With just KS X 1001 and base Big5, it would be 17484+27946 = 45430.

Being that HKSCS is a standard, registered MIME charset and the cost
is only 10k, and that it seems necessary for real world usage in Hong
Kong, I think it's pretty obvious that we should support it. So I
think the question we're left with is whether the CP949 (MS encoding)
extension for Korean is important to support. The cost is roughly 37k.

I'm going to keep doing research to see if identifying the characters
added in it sheds any light on whether there are important additions.
Obviously I would like to be able to exclude it but I don't want this
decision to be made unfairly based on my bias when it comes to bloat.
:)

Rich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  1:53     ` Harald Becker
@ 2013-08-05  3:39       ` Rich Felker
  2013-08-05  7:53         ` Harald Becker
  0 siblings, 1 reply; 18+ messages in thread
From: Rich Felker @ 2013-08-05  3:39 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 03:53:12AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> 04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>:
> 
> > Do I want to add that size? No, of course not, and that's why
> > I'm hoping (but not optimistic) that there may be a way to
> > elide a good part of the table based on patterns in the Hangul
> > syllables or the possibility that the giant extensions are
> > unimportant.
> 
> I think there is a way for easy configuration. See other mails,
> they clarify what my intention is.

I saw, and you're free to write such an iconv implementation if you
like, but it's not right for musl. Inventing elaborate mechanisms to
solve simple problems is the glibc way of doing things, not the musl
way.

iconv is not something that needs to be extensible. There is a finite
set of legacy encodings that's relevant to the world, and their
relevance is going to go down and down with time, not up.

> > Do I want to give users who have large volumes of legacy text
> > in their languages stored in these encodings the same respect
> > and dignity as users of other legacy encodings we already
> > support? Yes.
> 
> Of course. I won't dictate others which conversions they want to
> use. I only hat to have plenty of conversion tables on my system
> when I really know I never use such kind of conversions.

And your table for just Chinese is as large as all our tables
combined...

I agree you can make iconv smaller than musl's in the case where _no_
legacy DBCS are installed. But if you have just one, you'll be just as
large or larger than musl with them all. Just compare the size of
musl's tables to glibc's converters. I've worked hard to make them as
small as reasonably possible without doing hideous hacks like
decompression into an in-memory buffer, which would actually increase
bloat.

> ... but
> in case I really need, it can be added dynamically to the running
> system.

If you have root or want to setup nonstandard environment variables.

> > This issue was discussed a long time ago and the consensus
> > among users of static linking was that static linking is most
> > valuable when it makes the binary completely "portable" to
> > arbitrary Linux systems for the same cpu arch, without any
> > dependency on having files in particular locations on the
> > system aside from the minimum required by POSIX (things
> > like /dev/null), the standard Linux /proc mountpoint, and
> > universal config files like /etc/resolv.conf (even that is not
> > necessary, BTW, if you have a DNS on localhost). Having iconv
> > not work without external character tables is essentially a
> > form of dynamic linking, and carries with it issues like where
> > the files are to be found (you can override that with an
> > environment variable, but that can't be permitted for setuid
> > binaries), what happens if the format needs to change and the
> > format on the target machine is not compatible with the libc
> > version your binary was built with, etc. This is also the main
> > reason musl does not support something like nss.
> 
> I see the topic of self contained linking, and you are right that
> is is required, but it is fully possible to have best of both
> worlds without much overhead. Writing iconv as a virtual machine

It's not the best of both worlds. It's essentially the same as dynamic
linking.

> interpreter allows to statical link in the conversion byte code
> programs.

At several times the size of the current code/tables, and after the
user searches through the documentation to figure out how to do it.

> > Another side benefit of the current implementation is that it's
> > fully self-contained and independent of any system facilities.
> > It's pure C and can be taken out of musl and dropped in to any
> > program on any C implementation, including freestanding
> > (non-hosted) implementations. If it depended on the filesystem,
> > adapting it for such usage would be a lot more work.
> 
> The virtual machine shall be written in C, I've done such type of
> programming many times. So resulting code will compile with any C
> compiler, and byte code programs are just array of bytes,
> independent of machine byte order. So you will have any further
> dependencies.

It's not just a matter of dropping in. You'd have path searches to
modify or disable, build options to get the static tables turned on,
and all of this stuff would have to be integrated with the build
system for what you're dropping it into.

Complexity is never the solution. Honestly, I would take a 1mb
increase in binary size over this kind of complexity any day.
Thankfully, we don't have to make such a tradeoff.

> > A fsm implementation would be several times larger than the
> > implementations in iconv.c.
> 
> A bit larger, yes ... but not so much, if virtual machine gets
> designed carefully, and it will not increase in size, when there
> are more charsets get added (only size of byte code program
> added).

Charsets are not added. The time of charsets is over. It should have
been over in 1992, when Pike and Thompson made them obsolete, but it's
really over now.

> > It's possible that we could, at some time in the future,
> > support loading of user-defined character conversion files as
> > an added feature, but this should only be for really
> > special-purpose things like custom encodings used for games or
> > obsolete systems (old Mac, console games, IBM mainframes, etc.).
> 
> We can have it all, with not much overhead. And it is not only
> for such special cases. I don't like to install musl on my
> systems with Japanese, Chinese or Korean conversions, but in case
> I really need, I'm able to throw them in, without much work.
> 
> .... and we can add every character conversion on the fly, without
> rebuild of the library.

Maybe we should also include a bytecode interpreter for doing hostname
lookups, since you might want to do something other than DNS or a
hosts file. And a bytecode interpreter for user database lookups in
place of passwd files. And a bytecode interpreter for adding new
crypt() algorithms. And...

> > In terms of the criteria for what to include in musl itself, my
> > idea is that if you have a mail client or web browser based on
> > iconv for its character set handling, you should be able to
> > read the bulk of content in any language.
> 
> If you are building a mail client or web browser, but what if you
> want to include the possibility of charset conversion but stay at
> small size, just including conversions for only system relevant
> conversions, but not limiting to those. Any other conversion can
> then be added on the fly.

Then dynamic link it. If you want an extensible binary, you use
dynamic linking. The main reason for static linking is when you want a
binary whose behavior does not change with the runtime environment --
for example, for security purposes, for carrying around to other
machines that don't have the same runtime environment, etc.

Rich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  1:24     ` Harald Becker
@ 2013-08-05  3:13       ` Szabolcs Nagy
  2013-08-05  7:03         ` Harald Becker
  0 siblings, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2013-08-05  3:13 UTC (permalink / raw)
  To: musl

* Harald Becker <ralda@gmx.de> [2013-08-05 03:24:52 +0200]:
> iconv then shall:
> - look for some fixed charsets like ASCII, Latin-1, UTF-8, etc.
> - search table of with libc linked charsets
> - search table of with the program linked charsets
> - search for charset on external search path

sounds like a lot of extra management cost
(for libc, application writer and user as well)

it would be nice if the compiler could figure out
at build time (eg with lto) which tables are used
but i guess charsets often only known at runtime

> [Addendum after thinking a bit more: The byte code conversion
> files shall exist of a small statical header, followed by the
> byte code program. The header shall contain the charset name,
> version of required virtual machine and length of byte code. So
> you need only add all such conversion files to a big array of
> bytes and add a Null header to mark the end of table. Then you
> only need the start of the array and you are able to search
> through for a specific charset. The iconv function in libc
> contains a definition for an "unsigned char const
> *iconv_user_charsets = NULL;", which is linked in, when the user
> does not provide it's own definition. So iconv can search all
> linked in charset definitions, and need no code changes. Really
> simple configuration to select charsets to build in.]
> 

yes that can work, but it's a musl specific hack
that the application programmer need to take care of

> > if the format changes then dynamic linking is
> > problematic as well: you cannot update libc
> > in a single atomic operation
> 
> The byte code shall be independent of dynamic linking. The
> conversion files are only streams of bytes, which shall also be
> architecture independent. So you do only need to update the
> conversion files if the virtual machine definition of iconv has
> been changed (shall not be done much). External files may be read
> into malloc-ed buffers or mmap-ed, not linked in by the
> dynamical linker.
> 

that does not solve the format change problem
you cannot update libc without race
(unless you first replace the .so which supports
the old format as well as the new one, but then
libc has to support all previous formats)

it's probably easy to design a fixed format to
avoid this

it seems somewhat similar to the timezone problem
ecxept zoneinfo is maintained outside of libc so
there is not much choice, but there are the same
issues: updating it should be done carefully,
setuid programs must be handled specially etc


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  0:49   ` Rich Felker
@ 2013-08-05  1:53     ` Harald Becker
  2013-08-05  3:39       ` Rich Felker
  0 siblings, 1 reply; 18+ messages in thread
From: Harald Becker @ 2013-08-05  1:53 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich !

04-08-2013 20:49 Rich Felker <dalias@aerifal.cx>:

> Do I want to add that size? No, of course not, and that's why
> I'm hoping (but not optimistic) that there may be a way to
> elide a good part of the table based on patterns in the Hangul
> syllables or the possibility that the giant extensions are
> unimportant.

I think there is a way for easy configuration. See other mails,
they clarify what my intention is.

> Do I want to give users who have large volumes of legacy text
> in their languages stored in these encodings the same respect
> and dignity as users of other legacy encodings we already
> support? Yes.

Of course. I won't dictate others which conversions they want to
use. I only hat to have plenty of conversion tables on my system
when I really know I never use such kind of conversions. ... but
in case I really need, it can be added dynamically to the running
system.

> > Why cant we have all this character conversions on a state
> > driven machine which loads its information from a external
> > configuration file? This way we can have any kind of
> > conversion someone likes, by just adding the configuration
> > file for the required Unicode to X and X to Unicode
> > conversions.
> 
> This issue was discussed a long time ago and the consensus
> among users of static linking was that static linking is most
> valuable when it makes the binary completely "portable" to
> arbitrary Linux systems for the same cpu arch, without any
> dependency on having files in particular locations on the
> system aside from the minimum required by POSIX (things
> like /dev/null), the standard Linux /proc mountpoint, and
> universal config files like /etc/resolv.conf (even that is not
> necessary, BTW, if you have a DNS on localhost). Having iconv
> not work without external character tables is essentially a
> form of dynamic linking, and carries with it issues like where
> the files are to be found (you can override that with an
> environment variable, but that can't be permitted for setuid
> binaries), what happens if the format needs to change and the
> format on the target machine is not compatible with the libc
> version your binary was built with, etc. This is also the main
> reason musl does not support something like nss.

I see the topic of self contained linking, and you are right that
is is required, but it is fully possible to have best of both
worlds without much overhead. Writing iconv as a virtual machine
interpreter allows to statical link in the conversion byte code
programs. Those who are not linked in, can be searched for in the
filesystem. And a simple configuration option may disable file
system search completely, for really small embedded operation.
But beside this all conversions are the same and may be
freely copied between architectures, or linked statically into a
user program (just put byte stream of selected charsets into
simple C array of bytes).

> Another side benefit of the current implementation is that it's
> fully self-contained and independent of any system facilities.
> It's pure C and can be taken out of musl and dropped in to any
> program on any C implementation, including freestanding
> (non-hosted) implementations. If it depended on the filesystem,
> adapting it for such usage would be a lot more work.

The virtual machine shall be written in C, I've done such type of
programming many times. So resulting code will compile with any C
compiler, and byte code programs are just array of bytes,
independent of machine byte order. So you will have any further
dependencies.

> A fsm implementation would be several times larger than the
> implementations in iconv.c.

A bit larger, yes ... but not so much, if virtual machine gets
designed carefully, and it will not increase in size, when there
are more charsets get added (only size of byte code program
added).

> It's possible that we could, at some time in the future,
> support loading of user-defined character conversion files as
> an added feature, but this should only be for really
> special-purpose things like custom encodings used for games or
> obsolete systems (old Mac, console games, IBM mainframes, etc.).

We can have it all, with not much overhead. And it is not only
for such special cases. I don't like to install musl on my
systems with Japanese, Chinese or Korean conversions, but in case
I really need, I'm able to throw them in, without much work.

... and we can add every character conversion on the fly, without
rebuild of the library.

> In terms of the criteria for what to include in musl itself, my
> idea is that if you have a mail client or web browser based on
> iconv for its character set handling, you should be able to
> read the bulk of content in any language.

If you are building a mail client or web browser, but what if you
want to include the possibility of charset conversion but stay at
small size, just including conversions for only system relevant
conversions, but not limiting to those. Any other conversion can
then be added on the fly.

--
Harald

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-05  0:44   ` Szabolcs Nagy
@ 2013-08-05  1:24     ` Harald Becker
  2013-08-05  3:13       ` Szabolcs Nagy
  0 siblings, 1 reply; 18+ messages in thread
From: Harald Becker @ 2013-08-05  1:24 UTC (permalink / raw)
  Cc: musl, nsz

Hi !

05-08-2013 02:44 Szabolcs Nagy <nsz@port70.net>:

> * Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]:
> > Why cant we have all this character conversions on a state
> > driven machine which loads its information from a external
> > configuration file? This way we can have any kind of
> > conversion someone likes, by just adding the configuration
> > file for the required Unicode to X and X to Unicode
> > conversions.
> 
> external files provided by libc can work but they
> should be possible to embed into the binary

As far as I know, does glibc create small dynamically linked
objects and load those when required. This is architecture
specific. So you always need conversion files which correspond
to your C library.

My intention is to write conversion as a machine independent byte
code, which may be copied between machines of different
architecture. You need a charset conversion, just add the charset
bytecode to the conversion directory, which may be configurable
(directory name from environ variable with default fallback). May
even be a search path for conversion files, so conversion files
may be installed in different locations.

> otherwise a static binary is not self-contained
> and you have to move parts of the libc around
> along with the binary and if they are loaded
> from fixed path then it does not work at all
> (permissions, conflicting versions etc)

Ok, I see the static linking topic, but this is no problem with
byte code conversion programs. It can easily be added: Just add
all the conversion byte code programs together to a single big
array, with a name and offset table ahead, then link it into your
program.

May be done in two steps:

1) Create a selection file for musl build, and include the
specified charsets in libc.a/.so

2) Select the required charset files and create an .o file to
link into your program.

iconv then shall:
- look for some fixed charsets like ASCII, Latin-1, UTF-8, etc.
- search table of with libc linked charsets
- search table of with the program linked charsets
- search for charset on external search path

... or do in opposite direction and use first charset
conversion found.

This lookup is usually very small, except file system search, so
it shall not produce much overhead / bloat.

[Addendum after thinking a bit more: The byte code conversion
files shall exist of a small statical header, followed by the
byte code program. The header shall contain the charset name,
version of required virtual machine and length of byte code. So
you need only add all such conversion files to a big array of
bytes and add a Null header to mark the end of table. Then you
only need the start of the array and you are able to search
through for a specific charset. The iconv function in libc
contains a definition for an "unsigned char const
*iconv_user_charsets = NULL;", which is linked in, when the user
does not provide it's own definition. So iconv can search all
linked in charset definitions, and need no code changes. Really
simple configuration to select charsets to build in.]

> if the format changes then dynamic linking is
> problematic as well: you cannot update libc
> in a single atomic operation

The byte code shall be independent of dynamic linking. The
conversion files are only streams of bytes, which shall also be
architecture independent. So you do only need to update the
conversion files if the virtual machine definition of iconv has
been changed (shall not be done much). External files may be read
into malloc-ed buffers or mmap-ed, not linked in by the
dynamical linker.

--
Harald

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 22:39 ` Harald Becker
  2013-08-05  0:44   ` Szabolcs Nagy
@ 2013-08-05  0:49   ` Rich Felker
  2013-08-05  1:53     ` Harald Becker
  1 sibling, 1 reply; 18+ messages in thread
From: Rich Felker @ 2013-08-05  0:49 UTC (permalink / raw)
  To: musl

On Mon, Aug 05, 2013 at 12:39:43AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> > Worst-case, adding Korean and Traditional Chinese tables will
> > roughly double the size of iconv.o to around 150k. This will
> > noticably enlarge libc.so, but will make no difference to
> > static-linked programs except those using iconv. I'm hoping we
> > can make these additions less expensive, but I don't see a good
> > way yet.
> 
> Oh nooo, do you really want to add this statically to the iconv
> version?

Do I want to add that size? No, of course not, and that's why I'm
hoping (but not optimistic) that there may be a way to elide a good
part of the table based on patterns in the Hangul syllables or the
possibility that the giant extensions are unimportant.

Do I want to give users who have large volumes of legacy text in their
languages stored in these encodings the same respect and dignity as
users of other legacy encodings we already support? Yes.

> Why cant we have all this character conversions on a state driven
> machine which loads its information from a external configuration
> file? This way we can have any kind of conversion someone likes,
> by just adding the configuration file for the required Unicode to
> X and X to Unicode conversions.

This issue was discussed a long time ago and the consensus among users
of static linking was that static linking is most valuable when it
makes the binary completely "portable" to arbitrary Linux systems for
the same cpu arch, without any dependency on having files in
particular locations on the system aside from the minimum required by
POSIX (things like /dev/null), the standard Linux /proc mountpoint,
and universal config files like /etc/resolv.conf (even that is not
necessary, BTW, if you have a DNS on localhost). Having iconv not work
without external character tables is essentially a form of dynamic
linking, and carries with it issues like where the files are to be
found (you can override that with an environment variable, but that
can't be permitted for setuid binaries), what happens if the format
needs to change and the format on the target machine is not compatible
with the libc version your binary was built with, etc. This is also
the main reason musl does not support something like nss.

Another side benefit of the current implementation is that it's fully
self-contained and independent of any system facilities. It's pure C
and can be taken out of musl and dropped in to any program on any C
implementation, including freestanding (non-hosted) implementations.
If it depended on the filesystem, adapting it for such usage would be
a lot more work.

> State driven fsm interpreters are really small and fast and may
> read it's complete configuration from a file ... architecture
> independent file, so we may have same character conversion files
> for all architectures.

A fsm implementation would be several times larger than the
implementations in iconv.c. It's possible that we could, at some time
in the future, support loading of user-defined character conversion
files as an added feature, but this should only be for really
special-purpose things like custom encodings used for games or
obsolete systems (old Mac, console games, IBM mainframes, etc.).

In terms of the criteria for what to include in musl itself, my idea
is that if you have a mail client or web browser based on iconv for
its character set handling, you should be able to read the bulk of
content in any language.

Rich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 16:51 Rich Felker
  2013-08-04 22:39 ` Harald Becker
@ 2013-08-05  0:46 ` Harald Becker
  2013-08-05  5:00 ` Rich Felker
  2013-08-05  8:28 ` Roy
  3 siblings, 0 replies; 18+ messages in thread
From: Harald Becker @ 2013-08-05  0:46 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich,

in addition to my previous message to clarify some things:

04-08-2013 12:51 Rich Felker <dalias@aerifal.cx>:

> Worst-case, adding Korean and Traditional Chinese tables will
> roughly double the size of iconv.o to around 150k. This will
> noticably enlarge libc.so, but will make no difference to
> static-linked programs except those using iconv. I'm hoping we
> can make these additions less expensive, but I don't see a good
> way yet.

I would write iconv as a virtual machine interpreter for a very
simple byte code machine. The byte code (program) of the virtual
machine is just an array of unsigned bytes and the virtual
machine only contains the instructions to read next byte and
assemble a Unicode value or to receive a Unicode value and to
produce multi byte character output. The virtual machine code
itself works like a finite state machine to handle multi byte
character sets.

That way iconv consist of a small byte code interpreter to build
the virtual machine. Then it maps the byte code from an external
file for any required character set. This byte code from external
file consist of virtual machine instructions and conversion
tables. As this virtual machine shall be optimized for the
conversion purposes, conversion operations require only
interpretation of a view virtual instructions per converted
character (for simple character sets, big ones may need a few
more instructions). This operation is usually very fast, as not
much data is involved and instructions are highly optimized for
conversion operation.

The virtual machine works with a data space of only a few bytes
(less than 256), where some bytes need to preserve from one
conversion call to next. That is conversion needs a conversion
context of a few bytes (8..16). 

Independently from any character set conversion you want to add,
you only need a single byte code interpreter for iconv, which
will not increase in size. Only the external byte code /
conversion table for the charsets may vary in size. Simple char
sets, like Latins, consist of only a few bytes of byte code, big
charsets like Japanese, Chinese and Korean, need some more byte
code and may be some bigger translation tables ... but those
tables are only loaded if iconv needs to access such a charset.

iconv itself doesn't need to handle table of available charsets,
it only converts the charset name into a filename and opens the
corresponding charset translation file. On the charset file some
header and version check shall handle possible installation
conflicts. For any conversion request the virtual machine
interpreter runs through the byte code of the requested charset
and returns the conversion result. As the virtual machine shall
not contain operations to violate the remainder of the system,
this shall not break system security. At most the byte code is so
misbehaved that it runs forever, without producing an error or
any output. So the machine hangs just in an infinite loop during
conversion, until the process is terminated (a simple counter may
limit number of executed instructions and bail out in case of such
looping).

> At some point, especially if the cost is not reduced, I will
> probably add build-time options to exclude a configurable
> subset of the supported character encodings. This would not be
> extremely fine-grained, and the choices to exclude would
> probably be just: Japanese, Simplified Chinese, Traditional
> Chinese, and Korean. Legacy 8-bit might also be an option but
> these are so small I can't think of cases where it would be
> beneficial to omit them (5k for the tables on top of the 2k of
> actual code in iconv). Perhaps if there are cases where iconv
> is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices,
> dropping the 8-bit tables and all of the support code could be
> useful; the resulting iconv would be around 1k, I think.

You may skip all this, if iconv is constructed as a virtual
machine interpreter and all character conversions are loaded from
an external file.

As a fallback the library may compile in the byte code for some
small charset conversions, like ASCII, Latin-1, UTF-8. All other
charset conversions are loaded from external resources, which may
be installed or not depending on admins decision. And just
added if required later.  

--
Harald

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 22:39 ` Harald Becker
@ 2013-08-05  0:44   ` Szabolcs Nagy
  2013-08-05  1:24     ` Harald Becker
  2013-08-05  0:49   ` Rich Felker
  1 sibling, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2013-08-05  0:44 UTC (permalink / raw)
  To: musl

* Harald Becker <ralda@gmx.de> [2013-08-05 00:39:43 +0200]:
> Why cant we have all this character conversions on a state driven
> machine which loads its information from a external configuration
> file? This way we can have any kind of conversion someone likes,
> by just adding the configuration file for the required Unicode to
> X and X to Unicode conversions.

external files provided by libc can work but they
should be possible to embed into the binary

otherwise a static binary is not self-contained
and you have to move parts of the libc around
along with the binary and if they are loaded
from fixed path then it does not work at all
(permissions, conflicting versions etc)

if the format changes then dynamic linking is
problematic as well: you cannot update libc
in a single atomic operation


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: iconv Korean and Traditional Chinese research so far
  2013-08-04 16:51 Rich Felker
@ 2013-08-04 22:39 ` Harald Becker
  2013-08-05  0:44   ` Szabolcs Nagy
  2013-08-05  0:49   ` Rich Felker
  2013-08-05  0:46 ` Harald Becker
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 18+ messages in thread
From: Harald Becker @ 2013-08-04 22:39 UTC (permalink / raw)
  Cc: musl, dalias

Hi Rich !

> Worst-case, adding Korean and Traditional Chinese tables will
> roughly double the size of iconv.o to around 150k. This will
> noticably enlarge libc.so, but will make no difference to
> static-linked programs except those using iconv. I'm hoping we
> can make these additions less expensive, but I don't see a good
> way yet.

Oh nooo, do you really want to add this statically to the iconv
version?

Why cant we have all this character conversions on a state driven
machine which loads its information from a external configuration
file? This way we can have any kind of conversion someone likes,
by just adding the configuration file for the required Unicode to
X and X to Unicode conversions.

State driven fsm interpreters are really small and fast and may
read it's complete configuration from a file ... architecture
independent file, so we may have same character conversion files
for all architectures.

> At some point, especially if the cost is not reduced, I will
> probably add build-time options to exclude a configurable
> subset of the supported character encodings.

All this would go, if you do not load character conversions from
a static table. Why don't you consider loading a conversion
file for a given character set from predefined or configurable
directory. With the name of the character set as filename. If you
want to be the file in a directly read/modifiable form, you need
to add a minimalistic parser, else the file contents may be
considered binary data and you can just fread or mmap the file
and use the data to control character set conversion. Most
conversions only need minimal space, only some require bigger
conversion routines. ... and for those who dislike, you just
don't need to install the conversion files you do not want.

--
Harald

^ permalink raw reply	[flat|nested] 18+ messages in thread

* iconv Korean and Traditional Chinese research so far
@ 2013-08-04 16:51 Rich Felker
  2013-08-04 22:39 ` Harald Becker
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Rich Felker @ 2013-08-04 16:51 UTC (permalink / raw)
  To: musl

OK, so here's what I've found so far. Both legacy Korean and legacy
Traditional Chinese encodings have essentially a single base character
set:

Korean:
KS X 1001 (previously known as KS C 5601)
93 x 94 DBCS grid (A1-FD A1-FE)
All characters in BMP
17484 bytes table space

Traditional Chinese:
Big5 (CP950)
89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
All characters in BMP
27946 bytes table space

Both of these have various minor extensions, but the main extensions
of any relevance seem to be:

Korean:
CP949
Lead byte range is extended to 81-FD (125)
Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
44500 bytes table space

Traditional Chinese:
HKSCS (CP951)
Lead byte range is extended to 88-FE (119)
1651 characters outside BMP
37366 bytes table space for 16-bit mapping table, plus extra mapping
needed for characters outside BMP

The big remaining questions are:

1. How important are these extensions? I would guess the answer is
"fairly important", espectially for HKSCS where I believe the
additional characters are needed for encoding Cantonese words, but
it's less clear to me whether the Korean extensions are useful (they
seem to mainly be for the sake of completeness representing most/all
possible theoretical syllables that don't actually occur in words, but
this may be a naive misunderstanding on my part).

2. Are there patterns to exploit? For Korean, ALL of the Hangul
characters are actually combinations of several base letters. Unicode
encodes them all sequentially in a pattern where the conversion to
their constitutent letters is purely algorithmic, but there seems to
be no clean pattern in the legacy encodings, as the encodings started
out just incoding the "important" ones then adding less important
combinations in separate ranges.

Worst-case, adding Korean and Traditional Chinese tables will roughly
double the size of iconv.o to around 150k. This will noticably enlarge
libc.so, but will make no difference to static-linked programs except
those using iconv. I'm hoping we can make these additions less
expensive, but I don't see a good way yet.

At some point, especially if the cost is not reduced, I will probably
add build-time options to exclude a configurable subset of the
supported character encodings. This would not be extremely
fine-grained, and the choices to exclude would probably be just:
Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
8-bit might also be an option but these are so small I can't think of
cases where it would be beneficial to omit them (5k for the tables on
top of the 2k of actual code in iconv). Perhaps if there are cases
where iconv is needed purely for conversion between different Unicode
forms, but no legacy charsets, on tiny embedded devices, dropping the
8-bit tables and all of the support code could be useful; the
resulting iconv would be around 1k, I think.

Rich

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-08-05 14:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net>
2013-08-05 12:46 ` iconv Korean and Traditional Chinese research so far Rich Felker
2013-08-04 16:51 Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker
2013-08-05  5:00 ` Rich Felker
2013-08-05  8:28 ` Roy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).