From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3846
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@aerifal.cx>
Newsgroups: gmane.linux.lib.musl.general
Subject: Status of Big5 and extensions
Date: Wed, 7 Aug 2013 12:50:44 -0400
Message-ID: <20130807165044.GA14867@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1375894259 16696 80.91.229.3 (7 Aug 2013 16:50:59 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Wed, 7 Aug 2013 16:50:59 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-3850-gllmg-musl=m.gmane.org@lists.openwall.com Wed Aug 07 18:51:02 2013
Return-path: <musl-return-3850-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3850-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1V76wu-0006pa-QQ
	for gllmg-musl@plane.gmane.org; Wed, 07 Aug 2013 18:51:00 +0200
Original-Received: (qmail 28574 invoked by uid 550); 7 Aug 2013 16:50:58 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 28563 invoked from network); 7 Aug 2013 16:50:58 -0000
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Xref: news.gmane.org gmane.linux.lib.musl.general:3846
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3846>

OK, so after a lot of research and discussion, I'm about to commit the
first part of Big5 support in iconv. The plain "Big5" charset name is
going to be the maximal set everybody agrees on, which, as far as I
can tell, is just CP950. (Actually even IBM's Big5 variant in ICU
differs from CP950 in a few places, but it's just wrong. The one
ideograph where it differs conflicts with Unihan.txt, which is
authoritative on which Unicode characters encode which character
identities from historical CJK charsets.)

As for extensions, my understanding of HKSCS is to a point now where I
feel we can add it (charset name "Big5-HKSCS"), based on the 2008
government publication which has a few new characters beyond the old
versions in most software. (Thank you nsz for helping dig up all the
files and researching how they differ!) However there are a few
technical difficulties to implementation: the Unicode codepoints span
a 17-bit range rather than just a 16-bit range, so we need an
efficient way of doing the mappings. What's worse, several HKSCS
codepoints map to Latin characters with multiple combining marks which
have no precomposed representation. Supporting these will require
extending iconv to be able to output two or more characters of output
for each unit of input, which is mildly error-prone with the current
design, so I may hold off on HKSCS support until I overhaul some of
the core logic of the iconv function.

Now, the hard part: Taiwan extensions. I appreciate all the help from
Roy, but I'm still not to the point of having anything nearly ready to
be added. The UAO extension set has at least 290 mappings to
codepoints in the Unicode Private Use Areas (PUA), which makes it
unsuitable for inclusion as-is. This may be a resolvable issue if
these 290 characters all exist in Unicode and could be remapped, but I
do not feel any of us are qualified to determine this. So UAO is
probably still a long way away from being able to be adopted into
iconv, unless we have authoritative data on the identities of these
characters in the form of something from an official standards body or
at the very least multiple major vendors (enough to ensure that any
future standardization would be consistent and non-controversial).

If there are other supersets of CP950 (possibly subsets of UAO) which
would be useful for supporting Taiwanese users, I would very much like
to understand the situation on them and whether it's feasible to
include them in iconv at this point. Perhaps the most important thing
to know at this point is what practical difficulties exist for
Taiwanese users limited to CP950, and which extension charsets could
solve this. I feel like issues like "I cannot write my own name" are
on a completely different level than "I can't mix Japanese text in
with my legacy-encoded data [when I really should be using Unicode for
this anyway]". The former sort of issue is something that demands some
sort of support, even if imperfect or mildly hackish, whereas the
latter is not a justification for hacks and bypassing proper standards
processes.

Rich