From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/750 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Help establishing wctype character classes Date: Sun, 22 Apr 2012 21:44:14 -0400 Message-ID: <20120423014414.GK14673@brightrain.aerifal.cx> References: <20120422204103.GJ14673@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1335145295 24239 80.91.229.3 (23 Apr 2012 01:41:35 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Mon, 23 Apr 2012 01:41:35 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-751-gllmg-musl=m.gmane.org@lists.openwall.com Mon Apr 23 03:41:32 2012 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1SM8HS-0002cm-Hr for gllmg-musl@plane.gmane.org; Mon, 23 Apr 2012 03:41:30 +0200 Original-Received: (qmail 32598 invoked by uid 550); 23 Apr 2012 01:41:30 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 32590 invoked from network); 23 Apr 2012 01:41:29 -0000 Content-Disposition: inline In-Reply-To: <20120422204103.GJ14673@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:750 Archived-At: It seems glibc defines these via localedata/gen-unicode-ctype.c in the following ways: - Alphabetic: It has complex special-casing for some particular characters based on reported errors in Unicode, but basically it amounts to all of categories L*, Nl, Nd, and members of category So which have "LETTER" in their name. - Blank: Tab and all of category Zs without . - Space: The ASCII space class, plus all of Zs, Zl, and Zp without . - Control: Anything with as its name or category Zl/Zp. - Graphic: Any non-control, non-space. - Printable: Any non-control. - Punctuation: Any non-alphanumeric graphic. They cite this as "the traditional POSIX definition of punctuation", so I'm inclined to think they have a good idea here. Source: http://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/gen-unicode-ctype.c Note that wherever I've said "any", it's actually quantified only over defined characters; thus any characters added to Unicode later than the version glibc is sync'd with are reported as non-printable but also non-control. This seems highly undesirable to me; it's the reason "less" refuses to show new characters and instead prints . I would rather have every valid _codepoint_ be either control or printable, and all the non-space printable codepoints be graphic, but then among the graphic codepoints, only define alphanumeric or punctuation class for those codepoints assigned to characters in Unicode. That is, the hierarchy would break down as: 1. All valid codepoints are either control or printable. 2. All printable codepoints are either ASCII space or graphic. 3. All graphic codepoints are either assigned or unassigned. 4. All assigned graphic codepoints (graphic _characters_) are either alphanumeric or punctuation. It seems the only arbitrary decision left for us to make is how to divide the graphic characters between alphanumeric and punctuation. And this can be done by an explicit definition for either one, in terms of which the other will be implicitly defined. Here's a possible definition for alphanumeric: - All characters with Unicode Alphabetic property (includes L*, Nl, and special cases (Other_Alphabetic) defined in PropList.txt). - All characters with category Nd (digit). And possibly also: - Some or all characters with category No (other numeric - this includes things like superscripts, vulgar fractions, and script-specific numerical notations that are anything other than a direct copy of the ten decimal digits). Note that most of these in the Latin blocks are traditionally considered punctuation on Unixy systems. - Some or all characters with category So (other symbol) with LETTER in their names (just U+2129 turned Greek small letter iota and a bunch of useless circled/parenthesized letters). - Excluding a few special cases like glibc does (2 Thai characters that are actually not letters but punctuation, according to Theppitak Karoonboonyanan). Does this sound like a reasonable plan? Any tweaks needed? Rich