From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3800
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@aerifal.cx>
Newsgroups: gmane.linux.lib.musl.general
Subject: Request for help/info for Korean iconv task
Date: Fri, 2 Aug 2013 16:31:58 -0400
Message-ID: <20130802203157.GA14682@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1375475530 18984 80.91.229.3 (2 Aug 2013 20:32:10 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 2 Aug 2013 20:32:10 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-3804-gllmg-musl=m.gmane.org@lists.openwall.com Fri Aug 02 22:32:14 2013
Return-path: <musl-return-3804-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3804-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1V5M1F-0005WE-0F
	for gllmg-musl@plane.gmane.org; Fri, 02 Aug 2013 22:32:13 +0200
Original-Received: (qmail 1637 invoked by uid 550); 2 Aug 2013 20:32:11 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 1623 invoked from network); 2 Aug 2013 20:32:10 -0000
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Xref: news.gmane.org gmane.linux.lib.musl.general:3800
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3800>

Hi all,

One of the big goals for the next release cycle is getting Korean
legacy encoding support into iconv. On quick inspection, I see
KS-C-5601 and KS-X-1001, the latter of which seems to be a subset of
the former, as the base character sets involved. These appear to be
much like JIS0208 for Japanese: not directly usable, since they
overlap with ASCII, and requiring an encoding that maps them to usable
byte values. Further, it seems all the encodings in real-world use
(covering Unix, Windows, Mac) are based on the EUC scheme, which is
the least insane of the legacy DBCS designs. (An ISO-2022 based form
is also possibly used in email and on IRC; supporting this would be
dependent on stateful iconv, which is a separate agenda item.)

There are several issues I could use some help getting the full story
on:

1. Despite the big ones all being EUC-based, there seem to be several
variants: EUC-KR, Windows-949, and maybe others. It's not immediately
clear to me whether they differ in significant ways we would need to
support.

2. Long runs of the character table are Hangul syllables (obviously),
but it's unclear to me whether they are sufficiently organized in
patterns that knowing the pattern would enable us to elide large
enough parts of the table to be worthwhile.

3. There's some "Johab" encoding which may or may not be important,
and I'm not sure how it's related to the others.

4. Are there perhaps any stats on overall usage of different charsets
on the internet, on which we could base some judgements on relevance?

Here are the Unicode tables:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT
etc.

This also seems like a useful document but I have not had time to make
sense of it all yet:

http://stason.org/TULARC/languages/korean/8-What-are-KS-X-1001-KS-C-5601-and-other-Hangul-codes.html#.UfwXr6omNmk

Rich