From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3968 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: Re: Big5 "mostly" complete Date: Mon, 26 Aug 2013 21:53:49 -0400 Message-ID: <20130827015349.GG20515@brightrain.aerifal.cx> References: <20130817205757.GA32462@brightrain.aerifal.cx> <20130818073229.GE20515@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1377568442 16771 80.91.229.3 (27 Aug 2013 01:54:02 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 27 Aug 2013 01:54:02 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3972-gllmg-musl=m.gmane.org@lists.openwall.com Tue Aug 27 03:54:04 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1VE8Tq-00047p-Fu for gllmg-musl@plane.gmane.org; Tue, 27 Aug 2013 03:54:02 +0200 Original-Received: (qmail 25746 invoked by uid 550); 27 Aug 2013 01:54:01 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 25738 invoked from network); 27 Aug 2013 01:54:01 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3968 Archived-At: On Sun, Aug 18, 2013 at 07:19:57PM +0800, Roy wrote: > On Sun, 18 Aug 2013 15:32:29 +0800, Rich Felker wrote: > > >On Sun, Aug 18, 2013 at 12:20:47PM +0800, Roy wrote: > >>Both Big5-UAO and Big5-HKSCS are needed for those Taiwan people and > >>Hong Kong people. > >>For Big5-UAO, there is some commonly used dingbats(for example "♡" > >>mark) and numeric representations(for example "①") are in Big5-UAO > >>but not in CP950. > >>and Big5-UAO is still being used not only in ptt.cc telnet BBS, but > >>also in text data files(file lists/cue sheets) because of > >>not-supporting UTF-8 in applications(for example, Perl File-system > >>I/O in windows, CD-Rippers). > >>for Big5-HKSCS, it use used for storing commonly used Cantonese > >>ideographs (for example, "𨋢" means "lift" in Cantonese) in Hong > >>Kong. > > > >HKSCS is supported as of yesterday's commit. I'm aware that it's > >needed for representing Cantonese language in Big5, and that it's > >widely used on the web. > > > >What I'm not clear on is the necessity of UAO. Keep in mind that iconv > >is an API for information interchange: things like interpreting web > >content, email, old text files, etc. The fact that UAO exists is not > >alone reason to support it; it has to actually have usefulness in > >situations where the iconv interface should be used. If you want to > >see it included, this is what you need to convince us of: > > > >- That it's in widespread use in large volumes of existing data (on > > the web, text files, etc.) or data that is being newly generated > > (e.g. as a default encoding of popular mail software). > > People are told *NOT* to publish file with Big5-UAO to the web(or > say, people, even the creator of UAO, appeal to people that not to > publish file with Big5-UAO to the web), but still there are some > that's in archive format.(Like I said before, for example cue-sheet > file of CD-ROM image, etc.) > But for local data processing, UAO does facilitate file managing to > windows users. Based on this, I think: (1) It's reasonable to omit UAO for now, and (2) Support for iconv to load user-defined characer mappings would be a worthwhile feature to work on post-1.0. My reasoning is that the goal of iconv in musl, at least for the built-in character set conversions, is to facilitate information interchange, particularly reading of data that may be received in email, as documents published on the web, via IRC or IM protocols, etc. An encoding whose creators specifically request that it NOT be used for publishing/interchange is well outside this scope. I agree with your examples (CD-ROM cue sheets, archived text files, that telnet BBS, etc.) that there is a need by some users to process/import data encoded in UAO, but most of these usages do not seem to require general applications, treating charsets in an abstract, MIME-style manner, to be able to handle it. For many of the examples, a command-line conversion utility (BTW, there are ones much more powerful than iconv out there) would be the logical choice. For the BBS, my understanding is that most of its users are using special telnet/terminal apps with the conversion built-in. > >- That it's necessary to represent linguistic content in languages > > used in Taiwan, not just as a substitute for Unicode to represent > > foreign languages. > > It does, some Chinese ideographs are used as part of name, but not > in CP950 mapping like "喆" and "堃". How do these users send email or enter their names in web-based apps? My guess would be that the email clients switch to UTF-8 when encountering a character they can't encode in Big5, and that, nowadays, most web apps are built on CMS that are Unicode-based. Is this correct? > >- That failure to support it would put musl's iconv in a worse > > position of compatibility than other iconv implementations or > > software-specific (e.g. in-browser) character set conversions. > > Since people made Big5-UAO patch for libiconv and glibc(gconv) > unofficially to meet their uses, if musl libc have an optional > Big5-UAO mapping will be an advantage to Taiwan people. *nod* For what it's worth, how do those patches handle it? Do they add a new "Big5-UAO" charset name to iconv, or do they modify the existing Big5 to treat it as UAO? My feeling for now is to increase the priority of adding custom local charmap files to iconv after musl 1.0 is released. My main reason is that "intended for information interchange" vs "intended only for local use" seems to be the best guideline for whether an encoding is appropriate to include built-in. Rich