mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Roy <roytam@gmail.com>
To: musl@lists.openwall.com
Subject: Re: Re: Re: iconv Korean and Traditional Chinese research so far
Date: Tue, 06 Aug 2013 23:11:23 +0800	[thread overview]
Message-ID: <op.w1ehs9g7dyj81a@monster.itedn32a.localdomain> (raw)
In-Reply-To: <20130806133205.GS221@brightrain.aerifal.cx>

On Tue, 06 Aug 2013 21:32:05 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> On Tue, Aug 06, 2013 at 02:14:33PM +0800, Roy wrote:
>> >My impression (please correct me if I'm wrong) is that you can't use
>> >Big5-UAO as the system encoding on modern versions of Windows (just
>> >ancient ones where you install unmaintained third-party software that
>> >hacks the system charset tables)
>>
>> It doesn't "hack" the nls file but replaces with UAO-available CP950
>> nls file.
>> The executable(setup program) is generated with NSIS(Nullsoft
>> Scriptable Install System).
>> Since the nls file format doesn't change since NT 3.1 in 1993 till
>> now NT 6.2(i.e. Win 8.1 "Blue"), the UAO-available CP950 nls will
>> continue to work in newer versions of windows unless MS throw away
>> nls file format with something different.
>
> OK, thanks for clarifying. I'd still consider it a ways into the
> "hack" domain if the OS vendor still is not supporting it directly,
> but it does make a difference that it still works "cleanly". I was
> under the impression that these sorts of things changes between
> Windows versions in ways that would preclude using old, unmaintained
> patches like this. I agree that just the fact that certain OS vendors
> do not support an encoding is not in itself a reason not to support
> it.
>
>> >and that it's not supported in GNU
>> >libiconv. If this is the case, and especially if Big5-UAO's main use
>> >is on a telnet-based BBS where everybody is using special telnet
>> >clients that have their own Big5-UAO converters,
>>
>> GNU libiconv even not supports IBM EBCDIC(both SBCS and stateful
>> SBCS+DBCS)!
>>
>> So does it matter if GNU libiconv is not support whatever encodings?
>> (Yes glibc iconv(or say, gconv modules) does support both IBM EBCDIC
>> SBCS and stateful SBCS+DBCS encodings)
>
> I was under the impression that GNU libiconv was in sync with glibc's
> iconv, but I have not checked this. I actually was more interested in
> glibc's, which is in widespread use. glibc's inclusion or exclusion of
> a feature is not in itself a reason to include or exclude it, but
> supporting something that glibc supports does have the added
> motivation that it will increase compatibility with what programs are
> expecting.
>
>> >I'd find it really
>> >hard to justify trying to support this. But I'm open to hearing
>> >arguments on why we should, if you believe it's important.
>>
>> I think it will be nice to have build/link time option for those
>> "unpopular" encodings.
>>
>> >>For static linking, can we have conditional linking like QT does?
>> >
>> >My feeling is that it's a tradeoff, and probably has more pros than
>> >cons. Unlike QT, musl's iconv is extremely small.
>>
>> I would add "right now" here. When we adds more encoding later,
>> iconv module will be bigger than now, and people will need to find a
>> way to conditionally compiling the encoding they need (for both
>> dynamically or statically)
>
> It's never been my intent to add more encodings later (aside from pure
> non-table-based variants of existing ones, like the ISO-2022 versions)
> once coverage is complete, at least not as built-in features. This can
> be discussed if you think there are reasons it needs to change, but up
> until now, the plan has been to support:
>
> - ISO-8859 based 8-bit encodings
> - Other 8-bit encodings with actual legacy usage (mainly Cyrillic)
> - JIS 0208 based encodings
> - KS X 1001 based encodings
> - GB 2312 and supersets
> - Big5 and supersets
>
> All of those except Big5 and supersets are now supported, so short of
> any change, my position is that right now we're discussing the "last"
> significant addition to musl's iconv.
>
> Some things that are definitely outside the scope of musl's iconv:
>
> - Anything whose characters are not present in Unicode
> - Anything PUA-based (really, same as above)
> - Newly invented encodings with no historical encoded data
>
> What's more borderline is where UAO falls: encodings that have neither
> governmental or language-body-authority support nor any vendor support
> from other software vendors, but for which there is at least one major
> corpus of historical data and/or current usage for the encoding by
> users of the language(s) whose characters are encoded.
>
> However, based on the file at
>
> http://moztw.org/docs/big5/table/uao250-b2u.txt
>
> a number of the mappings UAO defines are into the private use area.
> This would generally preclude support (as this is a font-specific
> encoding, not a Unicode encoding) unless the affected characters have
> since been added to Unicode and could be remapped to the correct
> codepoints. Do you know the status on this?

Those are Big5-2003 compatibility code range. Big5-2003 is in CNS11643  
appendix section, but it is rarely used since no OS/Application supports  
it.
So skipping the PUA mappings are fine.

>
> I'm also still unclear on whether this is a superset of HKSCS (it's
> definitely not directly, but maybe it is if the PUA mappings are
> corrected; I did not do any detaield checks but just noted the lack of
> mappings to the non-BMP codepoints HKSCS uses).

No it isn't. There is some code conflict between HKSCS(2001/2004) and UAO.

>
>> >Even with all the
>> >above, the size of iconv.o will be under 130k, maybe closer to 110k.
>> >If you actually use iconv in your program, this is a small price to
>> >pay for having it fully functional. On the other hand, if linking it
>> >is conditional, you have to consider who makes the decision, and when.
>> >If it's at link time for each application, that's probably too much of
>> >a musl-specific version.
>>
>> Since statically linking libc-iconv is new area now (other libc
>> doesn't touch this topic much), I think we can create standard for
>> statically linking specified encoding table in link time.
>> (This is also a reason of "why libc should provide an unique
>> identifier with preprocessor define")
>
> I don't see how "creating a standard" for doing this would make the
> situation any better. Most software authors these days are at best
> tolerant of the existing of static linking, and more often hostile to
> it. They're not going to add specific build behavior for static
> linking, and even if they do, they're likely to get it wrong, in which
> case the user ends up stuck with binaries that can't process input in
> their language.
>
> Rich



  reply	other threads:[~2013-08-06 15:11 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-04 16:51 Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker
2013-08-05  5:00 ` Rich Felker
2013-08-05  8:28 ` Roy
2013-08-05 15:43   ` Rich Felker
2013-08-05 17:31     ` Rich Felker
2013-08-05 19:12   ` Rich Felker
2013-08-06  6:14     ` Roy
2013-08-06 13:32       ` Rich Felker
2013-08-06 15:11         ` Roy [this message]
2013-08-06 16:22           ` Rich Felker
2013-08-07  0:54             ` Roy
2013-08-07  7:20               ` Roy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=op.w1ehs9g7dyj81a@monster.itedn32a.localdomain \
    --to=roytam@gmail.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).