From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3800 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Request for help/info for Korean iconv task Date: Fri, 2 Aug 2013 16:31:58 -0400 Message-ID: <20130802203157.GA14682@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375475530 18984 80.91.229.3 (2 Aug 2013 20:32:10 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 2 Aug 2013 20:32:10 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3804-gllmg-musl=m.gmane.org@lists.openwall.com Fri Aug 02 22:32:14 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V5M1F-0005WE-0F for gllmg-musl@plane.gmane.org; Fri, 02 Aug 2013 22:32:13 +0200 Original-Received: (qmail 1637 invoked by uid 550); 2 Aug 2013 20:32:11 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 1623 invoked from network); 2 Aug 2013 20:32:10 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3800 Archived-At: Hi all, One of the big goals for the next release cycle is getting Korean legacy encoding support into iconv. On quick inspection, I see KS-C-5601 and KS-X-1001, the latter of which seems to be a subset of the former, as the base character sets involved. These appear to be much like JIS0208 for Japanese: not directly usable, since they overlap with ASCII, and requiring an encoding that maps them to usable byte values. Further, it seems all the encodings in real-world use (covering Unix, Windows, Mac) are based on the EUC scheme, which is the least insane of the legacy DBCS designs. (An ISO-2022 based form is also possibly used in email and on IRC; supporting this would be dependent on stateful iconv, which is a separate agenda item.) There are several issues I could use some help getting the full story on: 1. Despite the big ones all being EUC-based, there seem to be several variants: EUC-KR, Windows-949, and maybe others. It's not immediately clear to me whether they differ in significant ways we would need to support. 2. Long runs of the character table are Hangul syllables (obviously), but it's unclear to me whether they are sufficiently organized in patterns that knowing the pattern would enable us to elide large enough parts of the table to be worthwhile. 3. There's some "Johab" encoding which may or may not be important, and I'm not sure how it's related to the others. 4. Are there perhaps any stats on overall usage of different charsets on the internet, on which we could base some judgements on relevance? Here are the Unicode tables: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT etc. This also seems like a useful document but I have not had time to make sense of it all yet: http://stason.org/TULARC/languages/korean/8-What-are-KS-X-1001-KS-C-5601-and-other-Hangul-codes.html#.UfwXr6omNmk Rich