From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/751 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Help establishing wctype character classes Date: Sun, 22 Apr 2012 23:51:04 -0400 Message-ID: <20120423035104.GL14673@brightrain.aerifal.cx> References: <20120422204103.GJ14673@brightrain.aerifal.cx> <20120423014414.GK14673@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="L6iaP+gRLNZHKoI4" X-Trace: dough.gmane.org 1335152907 32394 80.91.229.3 (23 Apr 2012 03:48:27 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Mon, 23 Apr 2012 03:48:27 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-752-gllmg-musl=m.gmane.org@lists.openwall.com Mon Apr 23 05:48:26 2012 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1SMAGC-00064a-Nr for gllmg-musl@plane.gmane.org; Mon, 23 Apr 2012 05:48:20 +0200 Original-Received: (qmail 11598 invoked by uid 550); 23 Apr 2012 03:48:19 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 11589 invoked from network); 23 Apr 2012 03:48:19 -0000 Content-Disposition: inline In-Reply-To: <20120423014414.GK14673@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:751 Archived-At: --L6iaP+gRLNZHKoI4 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline And here's a quick implementation of the iswalnum set generator. At this point it just generates a dumb lookup table without doing any multi-level table optimizations or compression, and dumps the table to stdout in a visual form. I need to run some analysis on the table to determine the best way to get it down to a reasonable size. Rich --L6iaP+gRLNZHKoI4 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="gen_ctype_alpha.c" #include int main() { char *set = calloc(0x110000,1); char buf[128], dummy; int a, b; FILE *f; /* Alphabetic property */ f = fopen("DerivedCoreProperties.txt", "rb"); while (fgets(buf, sizeof buf, f)) { if (sscanf(buf, "%x..%x ; Alphabetic%c", &a, &b, &dummy)==3) for (; a<=b; a++) set[a]=1; else if (sscanf(buf, "%x ; Alphabetic%c", &a, &dummy)==2) set[a] = 1; } fclose(f); /* Plus Nd category */ f = fopen("UnicodeData.txt", "rb"); while (fgets(buf, sizeof buf, f)) { if (sscanf(buf, "%x;%*[^;];Nd%c", &a, &dummy)==2) set[a] = 1; } fclose(f); /* Fix misclassified Thai characters */ set[0xe2f] = set[0xe46] = 0; /* Fill in elided CJK ranges */ for (a=0x3400; a<=0x4db5; a++) set[a]=1; for (a=0x4e00; a<=0x9fcc; a++) set[a]=1; for (a=0xac00; a<=0xd7a3; a++) set[a]=1; for (a=0x20000; a<=0x2a6d6; a++) set[a]=1; for (a=0x2a700; a<=0x2b734; a++) set[a]=1; for (a=0x2b740; a<=0x2b81d; a++) set[a]=1; for (a=0; a<0x110000; a++) { putchar(set[a]?'*':'.'); if (!(a+1&63)) putchar('\n'); } } --L6iaP+gRLNZHKoI4--