mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@aerifal.cx>
To: musl@lists.openwall.com
Subject: Re: iconv Korean and Traditional Chinese research so far
Date: Mon, 5 Aug 2013 01:00:29 -0400	[thread overview]
Message-ID: <20130805050028.GD221@brightrain.aerifal.cx> (raw)
In-Reply-To: <20130804165152.GA32076@brightrain.aerifal.cx>

On Sun, Aug 04, 2013 at 12:51:52PM -0400, Rich Felker wrote:
> Both of these have various minor extensions, but the main extensions
> of any relevance seem to be:
> 
> Korean:
> CP949
> Lead byte range is extended to 81-FD (125)
> Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
> 44500 bytes table space
> 
> Traditional Chinese:
> HKSCS (CP951)
> Lead byte range is extended to 88-FE (119)
> 1651 characters outside BMP
> 37366 bytes table space for 16-bit mapping table, plus extra mapping
> needed for characters outside BMP
> 
> The big remaining questions are:
> 
> 1. How important are these extensions? I would guess the answer is
> "fairly important", espectially for HKSCS where I believe the
> additional characters are needed for encoding Cantonese words, but
> it's less clear to me whether the Korean extensions are useful (they
> seem to mainly be for the sake of completeness representing most/all
> possible theoretical syllables that don't actually occur in words, but
> this may be a naive misunderstanding on my part).

For what it's worth, there is no IANA charset registration for any
supplement to Korean. See the table here:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

The only entries for Korean are ISO-2022-KR and EUC-KR.

Big5-HKSCS however is registered. This matches my intuition that, of
the two, HKSCS would be more important to real-world usage than Korean
extensions.

If we were to omit CP949 and just go with KS X 1001, but include
HKSCS, the total size (minus a minimal amount of code needed) would be
17484+37366 = 54850.

With both supported, it would be 44500+37366 = 81866.

With just KS X 1001 and base Big5, it would be 17484+27946 = 45430.

Being that HKSCS is a standard, registered MIME charset and the cost
is only 10k, and that it seems necessary for real world usage in Hong
Kong, I think it's pretty obvious that we should support it. So I
think the question we're left with is whether the CP949 (MS encoding)
extension for Korean is important to support. The cost is roughly 37k.

I'm going to keep doing research to see if identifying the characters
added in it sheds any light on whether there are important additions.
Obviously I would like to be able to exclude it but I don't want this
decision to be made unfairly based on my bias when it comes to bloat.
:)

Rich


  parent reply	other threads:[~2013-08-05  5:00 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-04 16:51 Rich Felker
2013-08-04 22:39 ` Harald Becker
2013-08-05  0:44   ` Szabolcs Nagy
2013-08-05  1:24     ` Harald Becker
2013-08-05  3:13       ` Szabolcs Nagy
2013-08-05  7:03         ` Harald Becker
2013-08-05 12:54           ` Rich Felker
2013-08-05  0:49   ` Rich Felker
2013-08-05  1:53     ` Harald Becker
2013-08-05  3:39       ` Rich Felker
2013-08-05  7:53         ` Harald Becker
2013-08-05  8:24           ` Justin Cormack
2013-08-05 14:43             ` Rich Felker
2013-08-05 14:35           ` Rich Felker
2013-08-05  0:46 ` Harald Becker
2013-08-05  5:00 ` Rich Felker [this message]
2013-08-05  8:28 ` Roy
2013-08-05 15:43   ` Rich Felker
2013-08-05 17:31     ` Rich Felker
2013-08-05 19:12   ` Rich Felker
2013-08-06  6:14     ` Roy
2013-08-06 13:32       ` Rich Felker
2013-08-06 15:11         ` Roy
2013-08-06 16:22           ` Rich Felker
2013-08-07  0:54             ` Roy
2013-08-07  7:20               ` Roy
     [not found] <20130804232816.dc30d64f61e5ec441c34ffd4f788e58e.313eb9eea8.wbe@email22.secureserver.net>
2013-08-05 12:46 ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130805050028.GD221@brightrain.aerifal.cx \
    --to=dalias@aerifal.cx \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).