Re: Status of Big5 and extensions

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: Roy <roytam@gmail.com>
To: musl@lists.openwall.com
Subject: Re: Status of Big5 and extensions
Date: Thu, 08 Aug 2013 08:18:45 +0800	[thread overview]
Message-ID: <op.w1g1tjtzdyj81a@monster.itedn32a.localdomain> (raw)
In-Reply-To: <20130807165044.GA14867@brightrain.aerifal.cx>

在 Thu, 08 Aug 2013 00:50:44 +0800, Rich Felker <dalias@aerifal.cx> 寫道:

> OK, so after a lot of research and discussion, I'm about to commit the
> first part of Big5 support in iconv. The plain "Big5" charset name is
> going to be the maximal set everybody agrees on, which, as far as I
> can tell, is just CP950. (Actually even IBM's Big5 variant in ICU
> differs from CP950 in a few places, but it's just wrong. The one
> ideograph where it differs conflicts with Unihan.txt, which is
> authoritative on which Unicode characters encode which character
> identities from historical CJK charsets.)
>
> As for extensions, my understanding of HKSCS is to a point now where I
> feel we can add it (charset name "Big5-HKSCS"), based on the 2008
> government publication which has a few new characters beyond the old
> versions in most software. (Thank you nsz for helping dig up all the
> files and researching how they differ!) However there are a few
> technical difficulties to implementation: the Unicode codepoints span
> a 17-bit range rather than just a 16-bit range, so we need an
> efficient way of doing the mappings. What's worse, several HKSCS
> codepoints map to Latin characters with multiple combining marks which
> have no precomposed representation. Supporting these will require
> extending iconv to be able to output two or more characters of output
> for each unit of input, which is mildly error-prone with the current
> design, so I may hold off on HKSCS support until I overhaul some of
> the core logic of the iconv function.
>
> Now, the hard part: Taiwan extensions. I appreciate all the help from
> Roy, but I'm still not to the point of having anything nearly ready to
> be added. The UAO extension set has at least 290 mappings to
> codepoints in the Unicode Private Use Areas (PUA), which makes it
> unsuitable for inclusion as-is. This may be a resolvable issue if
> these 290 characters all exist in Unicode and could be remapped, but I
> do not feel any of us are qualified to determine this. So UAO is
> probably still a long way away from being able to be adopted into
> iconv, unless we have authoritative data on the identities of these
> characters in the form of something from an official standards body or
> at the very least multiple major vendors (enough to ensure that any
> future standardization would be consistent and non-controversial).
>

I did ask one of creator of UAO in  
http://forum.moztw.org/viewtopic.php?f=10&t=40174
And there is a reply of those PUA characters in Big5 view:
0xFA40 - 0xFA63: Reserved for user-defined characters
0xC8A5 - 0xC8B0: for Big5-2003 compliant
Others: "Art characters" of ChinaSea character set(csw 1.0, i.e.  
cswsmin.tte), which are not mapped to Unicode and the codepoint that has  
not occupied by HKSCS codepoints.

> If there are other supersets of CP950 (possibly subsets of UAO) which
> would be useful for supporting Taiwanese users, I would very much like
> to understand the situation on them and whether it's feasible to
> include them in iconv at this point. Perhaps the most important thing
> to know at this point is what practical difficulties exist for
> Taiwanese users limited to CP950, and which extension charsets could
> solve this. I feel like issues like "I cannot write my own name" are
> on a completely different level than "I can't mix Japanese text in
> with my legacy-encoded data [when I really should be using Unicode for
> this anyway]". The former sort of issue is something that demands some
> sort of support, even if imperfect or mildly hackish, whereas the
> latter is not a justification for hacks and bypassing proper standards
> processes.
>
> Rich

next prev parent reply	other threads:[~2013-08-08  0:18 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-07 16:50 Rich Felker
2013-08-08  0:18 ` Roy [this message]
2013-08-08  2:11   ` Rich Felker
2013-08-08  3:48     ` Roy
2013-08-08  3:53     ` Szabolcs Nagy
2013-08-08  4:30       ` Rich Felker
2013-08-08  4:50         ` Szabolcs Nagy
2013-08-08  5:31           ` Rich Felker
2013-08-08  7:19             ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=op.w1g1tjtzdyj81a@monster.itedn32a.localdomain \
    --to=roytam@gmail.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).