mailing list of musl libc
 help / color / mirror / code / Atom feed
* Request for help/info for Korean iconv task
@ 2013-08-02 20:31 Rich Felker
  0 siblings, 0 replies; only message in thread
From: Rich Felker @ 2013-08-02 20:31 UTC (permalink / raw)
  To: musl

Hi all,

One of the big goals for the next release cycle is getting Korean
legacy encoding support into iconv. On quick inspection, I see
KS-C-5601 and KS-X-1001, the latter of which seems to be a subset of
the former, as the base character sets involved. These appear to be
much like JIS0208 for Japanese: not directly usable, since they
overlap with ASCII, and requiring an encoding that maps them to usable
byte values. Further, it seems all the encodings in real-world use
(covering Unix, Windows, Mac) are based on the EUC scheme, which is
the least insane of the legacy DBCS designs. (An ISO-2022 based form
is also possibly used in email and on IRC; supporting this would be
dependent on stateful iconv, which is a separate agenda item.)

There are several issues I could use some help getting the full story
on:

1. Despite the big ones all being EUC-based, there seem to be several
variants: EUC-KR, Windows-949, and maybe others. It's not immediately
clear to me whether they differ in significant ways we would need to
support.

2. Long runs of the character table are Hangul syllables (obviously),
but it's unclear to me whether they are sufficiently organized in
patterns that knowing the pattern would enable us to elide large
enough parts of the table to be worthwhile.

3. There's some "Johab" encoding which may or may not be important,
and I'm not sure how it's related to the others.

4. Are there perhaps any stats on overall usage of different charsets
on the internet, on which we could base some judgements on relevance?

Here are the Unicode tables:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT
etc.

This also seems like a useful document but I have not had time to make
sense of it all yet:

http://stason.org/TULARC/languages/korean/8-What-are-KS-X-1001-KS-C-5601-and-other-Hangul-codes.html#.UfwXr6omNmk

Rich


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2013-08-02 20:31 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-02 20:31 Request for help/info for Korean iconv task Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).