From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3835 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: iconv Korean and Traditional Chinese research so far Date: Mon, 5 Aug 2013 13:31:45 -0400 Message-ID: <20130805173144.GL221@brightrain.aerifal.cx> References: <20130804165152.GA32076@brightrain.aerifal.cx> <20130805154344.GJ221@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375723916 15576 80.91.229.3 (5 Aug 2013 17:31:56 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 5 Aug 2013 17:31:56 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3839-gllmg-musl=m.gmane.org@lists.openwall.com Mon Aug 05 19:31:59 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V6OdS-0005bi-IA for gllmg-musl@plane.gmane.org; Mon, 05 Aug 2013 19:31:58 +0200 Original-Received: (qmail 1409 invoked by uid 550); 5 Aug 2013 17:31:57 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 1388 invoked from network); 5 Aug 2013 17:31:57 -0000 Content-Disposition: inline In-Reply-To: <20130805154344.GJ221@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3835 Archived-At: On Mon, Aug 05, 2013 at 11:43:45AM -0400, Rich Felker wrote: > > In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji > > characters in Japanese) and Japanese Katakana/Hiragana besides of > > Hangul characters. > > Yes, I'm aware of these. However, it looks to me like the only > characters outside the standard 94x94 grid zone are Hangul syllables, > and they appear in codepoint order. If so, even if there's not a good > pattern to where they're located, merely knowing that the ones that > are missing from the 94x94 grid are placed in order in the expanded > space is sufficient to perform algorithmic (albeit inefficient) > conversion. Does this sound correct? I've verified that this is correct and committed an implementation of Korean based on this principle, which I basically copied from my current implementation of GB18030's support for arbitrary Unicode codepoints. It has not been heavily tested but I did test it casually with all the important boundary values and it seems correct. Tests should probably be added to the test suite. Rich