mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Roy <roytam@gmail.com>
To: musl@lists.openwall.com
Subject: Re: Re: Re: Re: iconv Korean and Traditional Chinese research so far
Date: Wed, 07 Aug 2013 08:54:35 +0800	[thread overview]
Message-ID: <op.w1e8s9jtdyj81a@monster.itedn32a.localdomain> (raw)
In-Reply-To: <20130806162214.GX221@brightrain.aerifal.cx>

On Wed, 07 Aug 2013 00:22:15 +0800, Rich Felker <dalias@aerifal.cx> wrote:

> On Tue, Aug 06, 2013 at 11:11:23PM +0800, Roy wrote:
>> >However, based on the file at
>> >
>> >http://moztw.org/docs/big5/table/uao250-b2u.txt
>> >
>> >a number of the mappings UAO defines are into the private use area.
>> >This would generally preclude support (as this is a font-specific
>> >encoding, not a Unicode encoding) unless the affected characters have
>> >since been added to Unicode and could be remapped to the correct
>> >codepoints. Do you know the status on this?
>>
>> Those are Big5-2003 compatibility code range. Big5-2003 is in
>> CNS11643 appendix section, but it is rarely used since no
>> OS/Application supports it.
>> So skipping the PUA mappings are fine.
>
> OK, a few more questions...
>
> 1. What, if anything, is the accepted charset name for Big5-UAO, i.e.
> how would it appear in MIME headers, etc.?

No. Actually all Big5 variants uses "big5".

>
> 2. Can you give me an idea of the relationship between the Big5
> variants/extensions/supersets? I'm aware of Windows CP950, HKSCS, and
> now UAO. Is CP950 a common subset of them all, or is there a smaller
> base subset "plain Big5" that's the only shared part? What is ETEN and
> how does it fit in?

MS-CP950 can be considered as a common subset of HKSCS/UAO/ETEN etc.

Big5-ETEN mostly looks like CP950 but with Japanese Katakana/Hiragana area  
etc.

>
> 3. How should different MIME charset names be handled? In particular,
> what does plain "Big5" refer to? Should it be interpreted as CP950?

Since they use same MIME name, it depends on System codepage.
Some Hong Kong news websites still use Big5-HKSCS. For people using  
Internet Explorer with HKSCS installed, big5 MIME will map to  
Big5-HKSCS(or say, the only CP950 entry is mapped to CP951.nls which is  
HKSCS)
For Firefox users, they have to choose Big5-HKSCS by hand or by extension  
which checks domain name.

>
> 4. Is there anywhere to get clean semi-authoritative sources for the
> definitions of these charsets in plain text form. For HKSCS I found a
> government PDF file but it's useless because you can't extract the
> data in any meaningful way. Unicode has the CP950 file and "BIG5"
> file, but the latter refers to Unicode 1.1 in the comments and I've
> heard claims that it's completely wrong on many issues. Unihan.txt is
> also fairly useless because it only defines the mappings for
> ideographic characters, not the rest of the mappings in legacy CJK
> encodings. Short of anything better I may just have to use glibc
> output as a reference...

There is a documentation created by Mozilla Taiwan community:
http://moztw.org/docs/big5/
Google Translate:
http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=zh-TW&ie=UTF-8&u=http%3A%2F%2Fmoztw.org%2Fdocs%2Fbig5%2F

>
>> >I'm also still unclear on whether this is a superset of HKSCS (it's
>> >definitely not directly, but maybe it is if the PUA mappings are
>> >corrected; I did not do any detaield checks but just noted the lack of
>> >mappings to the non-BMP codepoints HKSCS uses).
>>
>> No it isn't. There is some code conflict between HKSCS(2001/2004) and  
>> UAO.
>
> Some conflict or heavy conflict? From an implementation standpoint, I
> want to know if this is something where they could use a common table
> plus "if (type==BIG5UAO) { /* fixups here */ ... }" or if they need
> completely separate tables.

Big5-HKSCS 2004 map for reference:
http://moztw.org/docs/big5/table/hkscs2004.txt
Use sed and awk to create b2u.txt for comparing:
$ sed -e '/^==/d' -e '1,2d' hkscs2004.txt| awk 'BEGIN{print "# big5  
unicode"}{print "0x" $1 " 0x" $4}' > hkscs2004-b2u.txt
In result:
http://roy.dnsd.me/hkscs2004-b2u.txt

And finally the diff:
http://roy.dnsd.me/uao250-hkscs2004.diff

The diff is huge so separated table is needed.

>
> Rich



  reply	other threads:[~2013-08-07  0:54 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-04 16:51 Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker
2013-08-05  5:00 ` Rich Felker
2013-08-05  8:28 ` Roy
2013-08-05 15:43   ` Rich Felker
2013-08-05 17:31     ` Rich Felker
2013-08-05 19:12   ` Rich Felker
2013-08-06  6:14     ` Roy
2013-08-06 13:32       ` Rich Felker
2013-08-06 15:11         ` Roy
2013-08-06 16:22           ` Rich Felker
2013-08-07  0:54             ` Roy [this message]
2013-08-07  7:20               ` Roy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=op.w1e8s9jtdyj81a@monster.itedn32a.localdomain \
    --to=roytam@gmail.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).