From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3823 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: iconv Korean and Traditional Chinese research so far Date: Mon, 5 Aug 2013 01:00:29 -0400 Message-ID: <20130805050028.GD221@brightrain.aerifal.cx> References: <20130804165152.GA32076@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375678839 14522 80.91.229.3 (5 Aug 2013 05:00:39 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 5 Aug 2013 05:00:39 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3827-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 07:00:43 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V6CuQ-0004OC-7G for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 07:00:42 +0200 Original-Received: (qmail 13494 invoked by uid 550); 5 Aug 2013 05:00:41 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 13484 invoked from network); 5 Aug 2013 05:00:41 -0000 Content-Disposition: inline In-Reply-To: <20130804165152.GA32076@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3823 Archived-At: On Sun, Aug 04, 2013 at 12:51:52PM -0400, Rich Felker wrote: > Both of these have various minor extensions, but the main extensions > of any relevance seem to be: > > Korean: > CP949 > Lead byte range is extended to 81-FD (125) > Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126) > 44500 bytes table space > > Traditional Chinese: > HKSCS (CP951) > Lead byte range is extended to 88-FE (119) > 1651 characters outside BMP > 37366 bytes table space for 16-bit mapping table, plus extra mapping > needed for characters outside BMP > > The big remaining questions are: > > 1. How important are these extensions? I would guess the answer is > "fairly important", espectially for HKSCS where I believe the > additional characters are needed for encoding Cantonese words, but > it's less clear to me whether the Korean extensions are useful (they > seem to mainly be for the sake of completeness representing most/all > possible theoretical syllables that don't actually occur in words, but > this may be a naive misunderstanding on my part). For what it's worth, there is no IANA charset registration for any supplement to Korean. See the table here: http://www.iana.org/assignments/character-sets/character-sets.xhtml The only entries for Korean are ISO-2022-KR and EUC-KR. Big5-HKSCS however is registered. This matches my intuition that, of the two, HKSCS would be more important to real-world usage than Korean extensions. If we were to omit CP949 and just go with KS X 1001, but include HKSCS, the total size (minus a minimal amount of code needed) would be 17484+37366 = 54850. With both supported, it would be 44500+37366 = 81866. With just KS X 1001 and base Big5, it would be 17484+27946 = 45430. Being that HKSCS is a standard, registered MIME charset and the cost is only 10k, and that it seems necessary for real world usage in Hong Kong, I think it's pretty obvious that we should support it. So I think the question we're left with is whether the CP949 (MS encoding) extension for Korean is important to support. The cost is roughly 37k. I'm going to keep doing research to see if identifying the characters added in it sheds any light on whether there are important additions. Obviously I would like to be able to exclude it but I don't want this decision to be made unfairly based on my bias when it comes to bloat. :) Rich