From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3849
Path: news.gmane.org!not-for-mail
From: Roy <roytam@gmail.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Status of Big5 and extensions
Date: Thu, 08 Aug 2013 08:18:45 +0800
Message-ID: <op.w1g1tjtzdyj81a@monster.itedn32a.localdomain>
References: <20130807165044.GA14867@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
Content-Transfer-Encoding: Quoted-Printable
X-Trace: ger.gmane.org 1375921147 15993 80.91.229.3 (8 Aug 2013 00:19:07 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 8 Aug 2013 00:19:07 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-3853-gllmg-musl=m.gmane.org@lists.openwall.com Thu Aug 08 02:19:10 2013
Return-path: <musl-return-3853-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3853-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1V7DwY-0003qC-Ns
	for gllmg-musl@plane.gmane.org; Thu, 08 Aug 2013 02:19:06 +0200
Original-Received: (qmail 30117 invoked by uid 550); 8 Aug 2013 00:19:04 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 30103 invoked from network); 8 Aug 2013 00:19:04 -0000
X-Injected-Via-Gmane: http://gmane.org/
Original-Lines: 79
Original-X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: 203186096008.static.ctinets.com
User-Agent: Opera Mail/11.64 (Win32)
Xref: news.gmane.org gmane.linux.lib.musl.general:3849
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3849>

=E5=9C=A8 Thu, 08 Aug 2013 00:50:44 +0800, Rich Felker <dalias@aerifal.c=
x> =E5=AF=AB=E9=81=93:

> OK, so after a lot of research and discussion, I'm about to commit the=

> first part of Big5 support in iconv. The plain "Big5" charset name is
> going to be the maximal set everybody agrees on, which, as far as I
> can tell, is just CP950. (Actually even IBM's Big5 variant in ICU
> differs from CP950 in a few places, but it's just wrong. The one
> ideograph where it differs conflicts with Unihan.txt, which is
> authoritative on which Unicode characters encode which character
> identities from historical CJK charsets.)
>
> As for extensions, my understanding of HKSCS is to a point now where I=

> feel we can add it (charset name "Big5-HKSCS"), based on the 2008
> government publication which has a few new characters beyond the old
> versions in most software. (Thank you nsz for helping dig up all the
> files and researching how they differ!) However there are a few
> technical difficulties to implementation: the Unicode codepoints span
> a 17-bit range rather than just a 16-bit range, so we need an
> efficient way of doing the mappings. What's worse, several HKSCS
> codepoints map to Latin characters with multiple combining marks which=

> have no precomposed representation. Supporting these will require
> extending iconv to be able to output two or more characters of output
> for each unit of input, which is mildly error-prone with the current
> design, so I may hold off on HKSCS support until I overhaul some of
> the core logic of the iconv function.
>
> Now, the hard part: Taiwan extensions. I appreciate all the help from
> Roy, but I'm still not to the point of having anything nearly ready to=

> be added. The UAO extension set has at least 290 mappings to
> codepoints in the Unicode Private Use Areas (PUA), which makes it
> unsuitable for inclusion as-is. This may be a resolvable issue if
> these 290 characters all exist in Unicode and could be remapped, but I=

> do not feel any of us are qualified to determine this. So UAO is
> probably still a long way away from being able to be adopted into
> iconv, unless we have authoritative data on the identities of these
> characters in the form of something from an official standards body or=

> at the very least multiple major vendors (enough to ensure that any
> future standardization would be consistent and non-controversial).
>

I did ask one of creator of UAO in  =

http://forum.moztw.org/viewtopic.php?f=3D10&t=3D40174
And there is a reply of those PUA characters in Big5 view:
0xFA40 - 0xFA63: Reserved for user-defined characters
0xC8A5 - 0xC8B0: for Big5-2003 compliant
Others: "Art characters" of ChinaSea character set(csw 1.0, i.e.  =

cswsmin.tte), which are not mapped to Unicode and the codepoint that has=
  =

not occupied by HKSCS codepoints.

> If there are other supersets of CP950 (possibly subsets of UAO) which
> would be useful for supporting Taiwanese users, I would very much like=

> to understand the situation on them and whether it's feasible to
> include them in iconv at this point. Perhaps the most important thing
> to know at this point is what practical difficulties exist for
> Taiwanese users limited to CP950, and which extension charsets could
> solve this. I feel like issues like "I cannot write my own name" are
> on a completely different level than "I can't mix Japanese text in
> with my legacy-encoded data [when I really should be using Unicode for=

> this anyway]". The former sort of issue is something that demands some=

> sort of support, even if imperfect or mildly hackish, whereas the
> latter is not a justification for hacks and bypassing proper standards=

> processes.
>
> Rich