* [PATCH] towupper/towlower: Update to Unicode 9.0 @ 2017-09-13 8:25 Reini Urban 2017-09-13 10:05 ` Reini Urban 0 siblings, 1 reply; 5+ messages in thread From: Reini Urban @ 2017-09-13 8:25 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 88 bytes --] taken from my safeclib (MIT licensed) and cross-checked with the perl unicode tables [-- Attachment #2: 0001-towupper-towlower-Update-to-Unicode-9.0.patch --] [-- Type: application/octet-stream, Size: 2358 bytes --] From c810e57fa5935c2802eb133e0495cfe5f7087195 Mon Sep 17 00:00:00 2001 From: Reini Urban <rurban@cpan.org> Date: Wed, 13 Sep 2017 10:09:03 +0200 Subject: [PATCH] towupper/towlower: Update to Unicode 9.0 taken from safeclib and cross-checked with the perl unicode tables --- src/ctype/towctrans.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git src/ctype/towctrans.c src/ctype/towctrans.c index cf13a86..59beddd 100644 --- src/ctype/towctrans.c +++ src/ctype/towctrans.c @@ -82,10 +82,26 @@ static const struct { CASELACE(0xa790,0xa792), CASELACE(0xa7a0,0xa7a8), + CASELACE(0xa7b4,0xa7b6), /* Unicode 8 */ + CASEMAP(0xff21,0xff3a,0xff41), { 0,0,0 } }; +static const struct { + unsigned int upper; + signed char lower; + unsigned char len; +} casemapsl[] = { + CASEMAP(0x10400,0x10427,0x10428), + + CASEMAP(0x104b0,0x104d3,0x104d8), /* Unicode 9 */ + CASEMAP(0x10c80,0x10cb2,0x10cc0), /* Unicode 8 */ + CASEMAP(0x118a0,0x118bf,0x118c0), /* Unicode 7 */ + CASEMAP(0x1e900,0x1e921,0x1e922), /* Unicode 9 */ + { 0,0,0 } +}; + static const unsigned short pairs[][2] = { { 'I', 0x0131 }, { 'S', 0x017f }, @@ -201,6 +217,17 @@ static const unsigned short pairs[][2] = { { 0xa78d, 0x265 }, { 0xa7aa, 0x266 }, + { 0xa7ab, 0x25c }, /* Unicode 7.0 */ + { 0xa7ac, 0x261 }, /* Unicode 7.0 */ + { 0xa7ad, 0x26c }, /* Unicode 7.0 */ + { 0xa7ae, 0x26a }, /* Unicode 9.0 */ + { 0xa7b0, 0x29e }, /* Unicode 7.0 */ + { 0xa7b1, 0x287 }, /* Unicode 7.0 */ + { 0xa7b2, 0x29d }, /* Unicode 7.0 */ + { 0xa7b3, 0xab53 }, /* Unicode 8.0 */ + { 0xa7b4, 0xa7b5 }, /* Unicode 8.0 */ + { 0xa7b6, 0xa7b7 }, /* Unicode 8.0 */ + { 0x10c7, 0x2d27 }, { 0x10cd, 0x2d2d }, @@ -250,8 +277,14 @@ static wchar_t __towcase(wchar_t wc, int lower) if (pairs[i][1-lower] == wc) return pairs[i][lower]; } - if ((unsigned)wc - (0x10428 - 0x28*lower) < 0x28) - return wc - 0x28 + 0x50*lower; + for (i=0; casemapsl[i].len; i++) { + int base = casemapsl[i].upper + (lmask & casemapsl[i].lower); + if ((unsigned)wc-base < casemapsl[i].len) { + if (casemapsl[i].lower == 1) + return wc + lower - ((wc-casemapsl[i].upper)&1); + return wc + lmul*casemapsl[i].lower; + } + } return wc; } -- 2.8.4 (Apple Git-73) [-- Attachment #3: Type: text/plain, Size: 33 bytes --] Reini Urban rurban@cpan.org ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] towupper/towlower: Update to Unicode 9.0 2017-09-13 8:25 [PATCH] towupper/towlower: Update to Unicode 9.0 Reini Urban @ 2017-09-13 10:05 ` Reini Urban 2017-09-13 18:13 ` Rich Felker 0 siblings, 1 reply; 5+ messages in thread From: Reini Urban @ 2017-09-13 10:05 UTC (permalink / raw) To: musl Wait a bit with that. I think I found some more Unicode 9.0 issues with the tables, and I’ve found a huge performance opportunity by sorting the 3 tables (mostly pairs), and break the loops earlier. This should come close to glibc table performance then, without the huge memory costs they have. I’ll write a perl regression testing script not to miss any more mappings, and maybe improve the current musl logic. This will need 1-2 days. I’ll also use it for cperl then. Reini Urban rurban@cpan.org > On Sep 13, 2017, at 10:25 AM, Reini Urban <rurban@cpan.org> wrote: > > taken from my safeclib (MIT licensed) and cross-checked with the perl unicode tables > > <0001-towupper-towlower-Update-to-Unicode-9.0.patch> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: [PATCH] towupper/towlower: Update to Unicode 9.0 2017-09-13 10:05 ` Reini Urban @ 2017-09-13 18:13 ` Rich Felker 2017-10-20 9:00 ` Reini Urban 0 siblings, 1 reply; 5+ messages in thread From: Rich Felker @ 2017-09-13 18:13 UTC (permalink / raw) To: musl On Wed, Sep 13, 2017 at 12:05:19PM +0200, Reini Urban wrote: > Wait a bit with that. I think I found some more Unicode 9.0 issues with the tables, > and I’ve found a huge performance opportunity by sorting the 3 tables (mostly pairs), > and break the loops earlier. > This should come close to glibc table performance then, without the huge memory costs they have. > > I’ll write a perl regression testing script not to miss any more mappings, and maybe > improve the current musl logic. This will need 1-2 days. > I’ll also use it for cperl then. Thanks for the update. I still need to publish the table generation code for all the other tables -- I got it mostly dug up and cleaned up but got interrupted last time so it's still not posted. With that it will be possible to update other things too, not just case mappings. A few of the existing tables are using an older version of the tabulation code that formats the big arrays differently, so I'll probably first make a commit to reformat them, so that it's possible to mechanically check that this commit does not change the generated .o files, then use the uniform formatting as the basis the subsequent update to Unicode 9.0. That should not affect the case mapping file though since it's not machine-generated. Rich ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: [PATCH] towupper/towlower: Update to Unicode 9.0 2017-09-13 18:13 ` Rich Felker @ 2017-10-20 9:00 ` Reini Urban 2017-10-25 18:38 ` Rich Felker 0 siblings, 1 reply; 5+ messages in thread From: Reini Urban @ 2017-10-20 9:00 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 1690 bytes --] On Wed, Sep 13, 2017 at 8:13 PM, Rich Felker wrote: > On Wed, Sep 13, 2017 at 12:05:19PM +0200, Reini Urban wrote: > > Wait a bit with that. I think I found some more Unicode 9.0 issues with > the tables, > > and I’ve found a huge performance opportunity by sorting the 3 tables > (mostly pairs), > > and break the loops earlier. > > This should come close to glibc table performance then, without the huge > memory costs they have. > > > > I’ll write a perl regression testing script not to miss any more > mappings, and maybe > > improve the current musl logic. This will need 1-2 days. > > I’ll also use it for cperl then. > > Thanks for the update. I still need to publish the table generation > code for all the other tables -- I got it mostly dug up and cleaned up > but got interrupted last time so it's still not posted. With that it > will be possible to update other things too, not just case mappings. > > A few of the existing tables are using an older version of the > tabulation code that formats the big arrays differently, so I'll > probably first make a commit to reformat them, so that it's possible > to mechanically check that this commit does not change the generated > .o files, then use the uniform formatting as the basis the subsequent > update to Unicode 9.0. That should not affect the case mapping file > though since it's not machine-generated. > I haven't yet seen your table generator, so I updated the tables with my version, as I use them in safeclib. Unicode 10.0 support plus sort tables for double search speed. I also added a harmless patch to a check-syntax target for emacs flymake support. -- Reini [-- Attachment #1.2: Type: text/html, Size: 2139 bytes --] [-- Attachment #2: 0001-towupper-towlower-Update-to-Unicode-10.0-and-sort.patch --] [-- Type: application/octet-stream, Size: 9420 bytes --] From bd9f1e60ac55143c507c767ba070ab99a5760baa Mon Sep 17 00:00:00 2001 From: Reini Urban <rurban@cpan.org> Date: Wed, 13 Sep 2017 10:09:03 +0200 Subject: [PATCH 1/2] towupper/towlower: Update to Unicode 10.0 and sort taken from safeclib and cross-checked with the perl unicode tables. sort the tables and exit when found. O(n) -> O(n/2) --- src/ctype/towctrans.c | 213 ++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 163 insertions(+), 50 deletions(-) diff --git src/ctype/towctrans.c src/ctype/towctrans.c index cf13a86..4745487 100644 --- src/ctype/towctrans.c +++ src/ctype/towctrans.c @@ -1,16 +1,21 @@ #include <ctype.h> #include <wctype.h> +#include <assert.h> #include "libc.h" #define CASEMAP(u1,u2,l) { (u1), (l)-(u1), (u2)-(u1)+1 } #define CASELACE(u1,u2) CASEMAP((u1),(u2),(u1)+1) +/* Unicode 10.0 */ + +/* must be sorted */ static const struct { unsigned short upper; signed char lower; unsigned char len; } casemaps[] = { - CASEMAP(0xc0,0xde,0xe0), + CASEMAP (0x00c0,0xd6,0xe0), + CASEMAP (0x00d8,0xde,0xf8), CASELACE(0x0100,0x012e), CASELACE(0x0132,0x0136), @@ -18,11 +23,21 @@ static const struct { CASELACE(0x014a,0x0176), CASELACE(0x0179,0x017d), - CASELACE(0x370,0x372), - CASEMAP(0x391,0x3a1,0x3b1), - CASEMAP(0x3a3,0x3ab,0x3c3), - CASEMAP(0x400,0x40f,0x450), - CASEMAP(0x410,0x42f,0x430), + CASELACE(0x01a0,0x1a4), + CASELACE(0x01b3,0x1b5), + CASELACE(0x01cd,0x1db), + CASELACE(0x01de,0x1ee), + CASELACE(0x01f8,0x21e), + CASELACE(0x0222,0x232), + CASELACE(0x0246,0x24e), + + CASELACE(0x0370,0x372), + CASEMAP (0x0388,0x38a,0x3ad), + CASEMAP (0x0393,0x39f,0x3b3), + CASEMAP (0x03a7,0x3ab,0x3c7), + CASELACE(0x03d8,0x3ee), + CASEMAP (0x0400,0x40f,0x450), + CASEMAP (0x0410,0x42f,0x430), CASELACE(0x460,0x480), CASELACE(0x48a,0x4be), @@ -80,17 +95,40 @@ static const struct { CASELACE(0xa77e,0xa786), CASELACE(0xa790,0xa792), + CASELACE(0xa796,0xa79e), CASELACE(0xa7a0,0xa7a8), + CASELACE(0xa7b4,0xa7b6), /* Unicode 8 */ + CASEMAP(0xff21,0xff3a,0xff41), { 0,0,0 } }; +/* must be sorted */ +static const struct { + unsigned int upper; + int lower; + unsigned short len; +} casemapsl[] = { + CASEMAP(0x13a0,0x13ef,0xab70), /* CHEROKEE reverse */ + CASEMAP(0xab70,0xabbf,0x13a0), /* CHEROKEE */ + CASEMAP(0x10400,0x10427,0x10428), + CASEMAP(0x104b0,0x104d3,0x104d8), /* Unicode 9 */ + CASEMAP(0x10c80,0x10cb2,0x10cc0), /* Unicode 8 */ + CASEMAP(0x118a0,0x118bf,0x118c0), /* Unicode 7 */ + CASEMAP(0x1e900,0x1e921,0x1e922), /* Unicode 9 */ + { 0,0,0 } +}; + +/* must now be sorted */ static const unsigned short pairs[][2] = { + /* upper - lower */ { 'I', 0x0131 }, { 'S', 0x017f }, + { 0x00b5, 0x03bc }, { 0x0130, 'i' }, { 0x0178, 0x00ff }, + { 0x017f, 0x73 }, { 0x0181, 0x0253 }, { 0x0182, 0x0183 }, { 0x0184, 0x0185 }, @@ -111,6 +149,7 @@ static const unsigned short pairs[][2] = { { 0x019c, 0x026f }, { 0x019d, 0x0272 }, { 0x019f, 0x0275 }, + /*CASELACE(0x01a0,0x01a4),*/ { 0x01a6, 0x0280 }, { 0x01a7, 0x01a8 }, { 0x01a9, 0x0283 }, @@ -119,38 +158,108 @@ static const unsigned short pairs[][2] = { { 0x01af, 0x01b0 }, { 0x01b1, 0x028a }, { 0x01b2, 0x028b }, + { 0x01b3, 0x01b4 }, + { 0x01b5, 0x01b6 }, { 0x01b7, 0x0292 }, { 0x01b8, 0x01b9 }, { 0x01bc, 0x01bd }, { 0x01c4, 0x01c6 }, - { 0x01c4, 0x01c5 }, + /*{ 0x01c4, 0x01c5 },*/ { 0x01c5, 0x01c6 }, { 0x01c7, 0x01c9 }, - { 0x01c7, 0x01c8 }, + /*{ 0x01c7, 0x01c8 },*/ { 0x01c8, 0x01c9 }, { 0x01ca, 0x01cc }, - { 0x01ca, 0x01cb }, + /*{ 0x01ca, 0x01cb },*/ + /*CASELACE(0x01cb,0x01db),*/ { 0x01cb, 0x01cc }, + { 0x01f1, 0x01f3 }, - { 0x01f1, 0x01f2 }, + /*{ 0x01f1, 0x01f2 },*/ { 0x01f2, 0x01f3 }, { 0x01f4, 0x01f5 }, { 0x01f6, 0x0195 }, { 0x01f7, 0x01bf }, + /*CASELACE(0x01f8,0x021e),*/ { 0x0220, 0x019e }, - { 0x0386, 0x03ac }, - { 0x0388, 0x03ad }, - { 0x0389, 0x03ae }, - { 0x038a, 0x03af }, + /*CASELACE(0x0222,0x0232),*/ + { 0x023a, 0x2c65 }, + { 0x023b, 0x23c }, + { 0x023d, 0x19a }, + { 0x023e, 0x2c66 }, + { 0x0241, 0x242 }, + { 0x0243, 0x180 }, + { 0x0244, 0x289 }, + { 0x0245, 0x28c }, + + { 0x0345, 0x3b9 }, + { 0x0376, 0x377 }, /* bogus greek 'symbol' */ + { 0x037f, 0x3f3 }, + { 0x0386, 0x3ac }, { 0x038c, 0x03cc }, { 0x038e, 0x03cd }, { 0x038f, 0x03ce }, - { 0x0399, 0x0345 }, - { 0x0399, 0x1fbe }, - { 0x03a3, 0x03c2 }, + { 0x0391, 0x3b1 }, + { 0x0392, 0x3b2 }, + { 0x0392, 0x3d0 }, /* reverse */ + /*CASEMAP (0x0393,0x39f,0x3b3),*/ + { 0x0395, 0x3f5 }, /* reverse */ + { 0x0398, 0x3d1 }, + { 0x0399, 0x1fbe },/* reverse */ + { 0x039a, 0x3f0 }, /* reverse */ + { 0x03a0, 0x3c0 }, + { 0x03a0, 0x3d6 }, /* reverse */ + { 0x03a1, 0x3c1 }, + { 0x03a1, 0x3f1 }, /* reverse */ + { 0x03a3, 0x3c3 }, + { 0x03a3, 0x3c2 }, /* reverse */ + { 0x03a4, 0x3c4 }, + { 0x03a5, 0x3c5 }, + { 0x03a6, 0x3c6 }, + { 0x03a6, 0x3d5 }, /* reverse */ + /*CASEMAP(0x0391,0x3a1,0x3b1),*/ + { 0x03c2, 0x3c3 }, + { 0x03cf, 0x3d7 }, + { 0x03d0, 0x3b2 }, + { 0x03d1, 0x3b8 }, + { 0x03d5, 0x3c6 }, + { 0x03d6, 0x3c0 }, + /*CASELACE(0x03d8,0x3ee),*/ + /*CASEMAP(0x03da,0x3ee,0x3db),*/ + { 0x03f0, 0x03ba }, + { 0x03f1, 0x03c1 }, + { 0x03f4, 0x03b8 }, + { 0x03f5, 0x03b5 }, { 0x03f7, 0x03f8 }, + { 0x03f9, 0x03f2 }, { 0x03fa, 0x03fb }, + { 0x03fd, 0x037b }, + { 0x03fe, 0x037c }, + { 0x03ff, 0x037d }, + /*CASEMAP(0x0400,0x40f,0x450), + CASEMAP(0x0410,0x42f,0x430),*/ + { 0x412, 0x1c80 }, /* reverse */ + { 0x414, 0x1c81 }, /* reverse */ + { 0x41e, 0x1c82 }, /* reverse */ + { 0x421, 0x1c83 }, /* reverse */ + { 0x422, 0x1c84 }, /* reverse */ + { 0x422, 0x1c85 }, /* reverse */ + { 0x42a, 0x1c86 }, /* reverse */ + { 0x462, 0x463 }, + { 0x462, 0x1c87 }, /* reverse */ + + { 0x04c0, 0x04cf}, + /*CASELACE(0x04c1,0x4cd),*/ + { 0x0528, 0x0529}, + { 0x052a, 0x052b}, + { 0x052c, 0x052d}, + { 0x052e, 0x052f}, + + { 0x10c7, 0x2d27 }, + { 0x10cd, 0x2d2d }, + { 0x1e60, 0x1e9b }, + { 0x1e9b, 0x1e61 }, { 0x1e9e, 0xdf }, { 0x1f59, 0x1f51 }, @@ -158,25 +267,11 @@ static const unsigned short pairs[][2] = { { 0x1f5d, 0x1f55 }, { 0x1f5f, 0x1f57 }, { 0x1fbc, 0x1fb3 }, + { 0x1fbe, 0x3b9 }, { 0x1fcc, 0x1fc3 }, { 0x1fec, 0x1fe5 }, { 0x1ffc, 0x1ff3 }, - { 0x23a, 0x2c65 }, - { 0x23b, 0x23c }, - { 0x23d, 0x19a }, - { 0x23e, 0x2c66 }, - { 0x241, 0x242 }, - { 0x243, 0x180 }, - { 0x244, 0x289 }, - { 0x245, 0x28c }, - { 0x3f4, 0x3b8 }, - { 0x3f9, 0x3f2 }, - { 0x3fd, 0x37b }, - { 0x3fe, 0x37c }, - { 0x3ff, 0x37d }, - { 0x4c0, 0x4cf }, - { 0x2126, 0x3c9 }, { 0x212a, 'k' }, { 0x212b, 0xe5 }, @@ -196,25 +291,25 @@ static const unsigned short pairs[][2] = { { 0x2c7f, 0x240 }, { 0x2cf2, 0x2cf3 }, + { 0xa64a, 0xa64b }, + { 0xa64a, 0x1c88 }, /* reverse */ + { 0xa77d, 0x1d79 }, { 0xa78b, 0xa78c }, { 0xa78d, 0x265 }, { 0xa7aa, 0x266 }, - { 0x10c7, 0x2d27 }, - { 0x10cd, 0x2d2d }, + { 0xa7ab, 0x25c }, /* Unicode 7.0 */ + { 0xa7ac, 0x261 }, /* Unicode 7.0 */ + { 0xa7ad, 0x26c }, /* Unicode 7.0 */ + { 0xa7ae, 0x26a }, /* Unicode 9.0 */ + { 0xa7b0, 0x29e }, /* Unicode 7.0 */ + { 0xa7b1, 0x287 }, /* Unicode 7.0 */ + { 0xa7b2, 0x29d }, /* Unicode 7.0 */ + { 0xa7b3, 0xab53 }, /* Unicode 8.0 */ + { 0xa7b4, 0xa7b5 }, /* Unicode 8.0 */ - /* bogus greek 'symbol' letters */ - { 0x376, 0x377 }, - { 0x39c, 0xb5 }, - { 0x392, 0x3d0 }, - { 0x398, 0x3d1 }, - { 0x3a6, 0x3d5 }, - { 0x3a0, 0x3d6 }, - { 0x39a, 0x3f0 }, - { 0x3a1, 0x3f1 }, - { 0x395, 0x3f5 }, - { 0x3cf, 0x3d7 }, + { 0xa7b6, 0xa7b7 }, /* Unicode 8.0 */ { 0,0 } }; @@ -229,29 +324,47 @@ static wchar_t __towcase(wchar_t wc, int lower) if (!iswalpha(wc) || (unsigned)wc - 0x0600 <= 0x0fff-0x0600 || (unsigned)wc - 0x2e00 <= 0xa63f-0x2e00 - || (unsigned)wc - 0xa800 <= 0xfeff-0xa800) + || (unsigned)wc - 0xa800 <= 0xab69-0xa800 + || (unsigned)wc - 0xabc0 <= 0xfeff-0xabc0) return wc; /* special case because the diff between upper/lower is too big */ - if (lower && (unsigned)wc - 0x10a0 < 0x2e) + if (lower && (unsigned)wc - 0x10a0 < 0x2e) { if (wc>0x10c5 && wc != 0x10c7 && wc != 0x10cd) return wc; else return wc + 0x2d00 - 0x10a0; - if (!lower && (unsigned)wc - 0x2d00 < 0x26) + } + if (!lower && (unsigned)wc - 0x2d00 < 0x26) { if (wc>0x2d25 && wc != 0x2d27 && wc != 0x2d2d) return wc; else return wc + 0x10a0 - 0x2d00; + } for (i=0; casemaps[i].len; i++) { int base = casemaps[i].upper + (lmask & casemaps[i].lower); + assert(i>0 ? casemaps[i].upper >= casemaps[i-1].upper : 1); if ((unsigned)wc-base < casemaps[i].len) { if (casemaps[i].lower == 1) return wc + lower - ((wc-casemaps[i].upper)&1); return wc + lmul*casemaps[i].lower; } + if (lower && casemaps[i].upper > wc) + break; } for (i=0; pairs[i][1-lower]; i++) { + assert(i>0 ? pairs[i][0] >= pairs[i-1][0] : 1); if (pairs[i][1-lower] == wc) return pairs[i][lower]; + if (lower && pairs[i][0] > wc) + break; + } + for (i=0; casemapsl[i].len; i++) { + unsigned long base = casemapsl[i].upper + (lmask & casemapsl[i].lower); + assert(i>0 ? casemapsl[i].upper >= casemapsl[i-1].upper : 1); + if ((unsigned)wc-base < casemapsl[i].len) { + if (casemapsl[i].lower == 1) + return wc + lower - ((wc-casemapsl[i].upper)&1); + return wc + lmul*casemapsl[i].lower; + } + if (lower && casemaps[i].upper > wc) + break; } - if ((unsigned)wc - (0x10428 - 0x28*lower) < 0x28) - return wc - 0x28 + 0x50*lower; return wc; } -- 2.8.4 (Apple Git-73) [-- Attachment #3: 0002-add-emacs-flymake-support.patch --] [-- Type: application/octet-stream, Size: 1031 bytes --] From 347be94765fe4993e143ed33ae874c642446f3ca Mon Sep 17 00:00:00 2001 From: Reini Urban <rurban@cpan.org> Date: Fri, 20 Oct 2017 10:46:44 +0200 Subject: [PATCH 2/2] add emacs flymake support --- Makefile | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git Makefile Makefile index d2e8997..9eb0cd6 100644 --- Makefile +++ Makefile @@ -193,6 +193,11 @@ obj/%-clang: $(srcdir)/tools/%-clang.in config.mak sed -e 's!@CC@!$(WRAPCC_CLANG)!g' -e 's!@PREFIX@!$(prefix)!g' -e 's!@INCDIR@!$(includedir)!g' -e 's!@LIBDIR@!$(libdir)!g' -e 's!@LDSO@!$(LDSO_PATHNAME)!g' $< > $@ chmod +x $@ +# emacs flymake-mode +check-syntax: + test -n "$(CHK_SOURCES)" && \ + $(CC) $(CFLAGS_ALL) -o /dev/null -S $(CHK_SOURCES) + $(DESTDIR)$(bindir)/%: obj/% $(INSTALL) -D $< $@ @@ -239,4 +244,4 @@ clean: distclean: clean rm -f config.mak -.PHONY: all clean install install-libs install-headers install-tools +.PHONY: all clean install install-libs install-headers install-tools check-syntax -- 2.8.4 (Apple Git-73) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: [PATCH] towupper/towlower: Update to Unicode 9.0 2017-10-20 9:00 ` Reini Urban @ 2017-10-25 18:38 ` Rich Felker 0 siblings, 0 replies; 5+ messages in thread From: Rich Felker @ 2017-10-25 18:38 UTC (permalink / raw) To: musl On Fri, Oct 20, 2017 at 11:00:04AM +0200, Reini Urban wrote: > On Wed, Sep 13, 2017 at 8:13 PM, Rich Felker wrote: > > > On Wed, Sep 13, 2017 at 12:05:19PM +0200, Reini Urban wrote: > > > Wait a bit with that. I think I found some more Unicode 9.0 issues with > > the tables, > > > and I’ve found a huge performance opportunity by sorting the 3 tables > > (mostly pairs), > > > and break the loops earlier. > > > This should come close to glibc table performance then, without the huge > > memory costs they have. > > > > > > I’ll write a perl regression testing script not to miss any more > > mappings, and maybe > > > improve the current musl logic. This will need 1-2 days. > > > I’ll also use it for cperl then. > > > > Thanks for the update. I still need to publish the table generation > > code for all the other tables -- I got it mostly dug up and cleaned up > > but got interrupted last time so it's still not posted. With that it > > will be possible to update other things too, not just case mappings. > > > > A few of the existing tables are using an older version of the > > tabulation code that formats the big arrays differently, so I'll > > probably first make a commit to reformat them, so that it's possible > > to mechanically check that this commit does not change the generated > > .o files, then use the uniform formatting as the basis the subsequent > > update to Unicode 9.0. That should not affect the case mapping file > > though since it's not machine-generated. > > > > > I haven't yet seen your table generator, so I updated the tables with my > version, as I > use them in safeclib. > Unicode 10.0 support plus sort tables for double search speed. This patch contains multiple independent changes, some of them possibly incorrect, that makes it hard to review, but I'll try: > taken from safeclib and cross-checked with the perl unicode tables. > sort the tables and exit when found. O(n) -> O(n/2) I think here you mean average runtime, not big-O, but it's not clear that the average is improved except when characters are evenly distributed. I'm not sure if the existing table was sorted to put common/important characters first, but doing so would probably help real-world performance more than sorting by codepoint. If sorting by codepoint really does turn out to be a smart thing to do, it needs to be done as a separate change, so that anyone reading the file history can see the functional changes (addition of new mappings) separately from optimizations. Otherwise history of changes to mappings is buried. > --- > src/ctype/towctrans.c | 213 ++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 163 insertions(+), 50 deletions(-) > > diff --git src/ctype/towctrans.c src/ctype/towctrans.c > index cf13a86..4745487 100644 > --- src/ctype/towctrans.c > +++ src/ctype/towctrans.c > @@ -1,16 +1,21 @@ > #include <ctype.h> > #include <wctype.h> > +#include <assert.h> > #include "libc.h" > > #define CASEMAP(u1,u2,l) { (u1), (l)-(u1), (u2)-(u1)+1 } > #define CASELACE(u1,u2) CASEMAP((u1),(u2),(u1)+1) > > +/* Unicode 10.0 */ > + > +/* must be sorted */ > static const struct { > unsigned short upper; > signed char lower; > unsigned char len; > } casemaps[] = { > - CASEMAP(0xc0,0xde,0xe0), > + CASEMAP (0x00c0,0xd6,0xe0), > + CASEMAP (0x00d8,0xde,0xf8), This looks like a silent bugfix for incorrect mapping of ×/÷ as a case pair. But since iswalpha short-circuits the whole function, I think there's actually no change in behavior and this is just a pessimization, splitting 1 range to 2. Said differently, it's okay to have spurious junk in the case mapping table as long as the mappings are correct for alphabetic-class characters, and there might be other cases where that allowance helps. Also FWIW the change in spacing here made it hard for me to notice that there was also a table contents change going on. > CASELACE(0x0100,0x012e), > CASELACE(0x0132,0x0136), > @@ -18,11 +23,21 @@ static const struct { > CASELACE(0x014a,0x0176), > CASELACE(0x0179,0x017d), > > - CASELACE(0x370,0x372), > - CASEMAP(0x391,0x3a1,0x3b1), > - CASEMAP(0x3a3,0x3ab,0x3c3), > - CASEMAP(0x400,0x40f,0x450), > - CASEMAP(0x410,0x42f,0x430), > + CASELACE(0x01a0,0x1a4), > + CASELACE(0x01b3,0x1b5), > + CASELACE(0x01cd,0x1db), > + CASELACE(0x01de,0x1ee), > + CASELACE(0x01f8,0x21e), > + CASELACE(0x0222,0x232), > + CASELACE(0x0246,0x24e), It looks like the motivation here was to have all basic Latin, Greek, and Cyrillic characters covered as close to the top of the table as possible, with obscure characters for each of these alphabets later. If there are other alphabets/languages with case mappings their basic characters should probably also be early in the table or we should have some way to make the mappings more efficient than linear-time. > + > + CASELACE(0x0370,0x372), > + CASEMAP (0x0388,0x38a,0x3ad), > + CASEMAP (0x0393,0x39f,0x3b3), > + CASEMAP (0x03a7,0x3ab,0x3c7), > + CASELACE(0x03d8,0x3ee), > + CASEMAP (0x0400,0x40f,0x450), > + CASEMAP (0x0410,0x42f,0x430), > > CASELACE(0x460,0x480), > CASELACE(0x48a,0x4be), > @@ -80,17 +95,40 @@ static const struct { > CASELACE(0xa77e,0xa786), > > CASELACE(0xa790,0xa792), > + CASELACE(0xa796,0xa79e), > CASELACE(0xa7a0,0xa7a8), > > + CASELACE(0xa7b4,0xa7b6), /* Unicode 8 */ > + > CASEMAP(0xff21,0xff3a,0xff41), > { 0,0,0 } > }; > > +/* must be sorted */ > +static const struct { > + unsigned int upper; > + int lower; > + unsigned short len; > +} casemapsl[] = { Indention used spaces here rather than tabs. > + CASEMAP(0x13a0,0x13ef,0xab70), /* CHEROKEE reverse */ > + CASEMAP(0xab70,0xabbf,0x13a0), /* CHEROKEE */ > + CASEMAP(0x10400,0x10427,0x10428), > + CASEMAP(0x104b0,0x104d3,0x104d8), /* Unicode 9 */ > + CASEMAP(0x10c80,0x10cb2,0x10cc0), /* Unicode 8 */ > + CASEMAP(0x118a0,0x118bf,0x118c0), /* Unicode 7 */ > + CASEMAP(0x1e900,0x1e921,0x1e922), /* Unicode 9 */ > + { 0,0,0 } > +}; > + > +/* must now be sorted */ > static const unsigned short pairs[][2] = { > + /* upper - lower */ > { 'I', 0x0131 }, > { 'S', 0x017f }, > + { 0x00b5, 0x03bc }, If desired, this is a change that requires its own discussion and independent commit, but I think it's probably a bad change. > { 0x0130, 'i' }, > { 0x0178, 0x00ff }, > + { 0x017f, 0x73 }, Likewise -- it's not a new character but change in existing mappings -- but this one is probably okay. > { 0x0181, 0x0253 }, > { 0x0182, 0x0183 }, > { 0x0184, 0x0185 }, > @@ -111,6 +149,7 @@ static const unsigned short pairs[][2] = { > { 0x019c, 0x026f }, > { 0x019d, 0x0272 }, > { 0x019f, 0x0275 }, > + /*CASELACE(0x01a0,0x01a4),*/ > { 0x01a6, 0x0280 }, > { 0x01a7, 0x01a8 }, > { 0x01a9, 0x0283 }, Is there a meaning behind the embedded comment here? > @@ -119,38 +158,108 @@ static const unsigned short pairs[][2] = { > { 0x01af, 0x01b0 }, > { 0x01b1, 0x028a }, > { 0x01b2, 0x028b }, > + { 0x01b3, 0x01b4 }, > + { 0x01b5, 0x01b6 }, These seem to overlap with CASELACE(0x01b3,0x1b5) above. > { 0x01b7, 0x0292 }, > { 0x01b8, 0x01b9 }, > { 0x01bc, 0x01bd }, > { 0x01c4, 0x01c6 }, > - { 0x01c4, 0x01c5 }, > + /*{ 0x01c4, 0x01c5 },*/ > { 0x01c5, 0x01c6 }, This looks incorrect -- it eliminates the mapping from titlecase to uppercase. > { 0x01c7, 0x01c9 }, > - { 0x01c7, 0x01c8 }, > + /*{ 0x01c7, 0x01c8 },*/ > { 0x01c8, 0x01c9 }, Likewise. > { 0x01ca, 0x01cc }, > - { 0x01ca, 0x01cb }, > + /*{ 0x01ca, 0x01cb },*/ > + /*CASELACE(0x01cb,0x01db),*/ > { 0x01cb, 0x01cc }, Likewise. > + > { 0x01f1, 0x01f3 }, > - { 0x01f1, 0x01f2 }, > + /*{ 0x01f1, 0x01f2 },*/ > { 0x01f2, 0x01f3 }, Likewise. > { 0x01f4, 0x01f5 }, > { 0x01f6, 0x0195 }, > { 0x01f7, 0x01bf }, > + /*CASELACE(0x01f8,0x021e),*/ > { 0x0220, 0x019e }, > - { 0x0386, 0x03ac }, > - { 0x0388, 0x03ad }, > - { 0x0389, 0x03ae }, > - { 0x038a, 0x03af }, > + /*CASELACE(0x0222,0x0232),*/ > + { 0x023a, 0x2c65 }, > + { 0x023b, 0x23c }, > + { 0x023d, 0x19a }, > + { 0x023e, 0x2c66 }, > + { 0x0241, 0x242 }, > + { 0x0243, 0x180 }, > + { 0x0244, 0x289 }, > + { 0x0245, 0x28c }, > + > + { 0x0345, 0x3b9 }, > + { 0x0376, 0x377 }, /* bogus greek 'symbol' */ > + { 0x037f, 0x3f3 }, > + { 0x0386, 0x3ac }, > { 0x038c, 0x03cc }, > { 0x038e, 0x03cd }, > { 0x038f, 0x03ce }, > - { 0x0399, 0x0345 }, > - { 0x0399, 0x1fbe }, > - { 0x03a3, 0x03c2 }, > + { 0x0391, 0x3b1 }, > + { 0x0392, 0x3b2 }, > + { 0x0392, 0x3d0 }, /* reverse */ By reverse you mean the same thing as the titlecase mappings removed above (that the entry is only used in the towupper direction)? > + /*CASEMAP (0x0393,0x39f,0x3b3),*/ > + { 0x0395, 0x3f5 }, /* reverse */ > + { 0x0398, 0x3d1 }, > + { 0x0399, 0x1fbe },/* reverse */ > + { 0x039a, 0x3f0 }, /* reverse */ > + { 0x03a0, 0x3c0 }, > + { 0x03a0, 0x3d6 }, /* reverse */ > + { 0x03a1, 0x3c1 }, > + { 0x03a1, 0x3f1 }, /* reverse */ > + { 0x03a3, 0x3c3 }, > + { 0x03a3, 0x3c2 }, /* reverse */ > + { 0x03a4, 0x3c4 }, > + { 0x03a5, 0x3c5 }, > + { 0x03a6, 0x3c6 }, > + { 0x03a6, 0x3d5 }, /* reverse */ > + /*CASEMAP(0x0391,0x3a1,0x3b1),*/ > + { 0x03c2, 0x3c3 }, > + { 0x03cf, 0x3d7 }, > + { 0x03d0, 0x3b2 }, > + { 0x03d1, 0x3b8 }, > + { 0x03d5, 0x3c6 }, > + { 0x03d6, 0x3c0 }, > + /*CASELACE(0x03d8,0x3ee),*/ > + /*CASEMAP(0x03da,0x3ee,0x3db),*/ > + { 0x03f0, 0x03ba }, > + { 0x03f1, 0x03c1 }, > + { 0x03f4, 0x03b8 }, > + { 0x03f5, 0x03b5 }, > { 0x03f7, 0x03f8 }, > + { 0x03f9, 0x03f2 }, > { 0x03fa, 0x03fb }, > + { 0x03fd, 0x037b }, > + { 0x03fe, 0x037c }, > + { 0x03ff, 0x037d }, > + /*CASEMAP(0x0400,0x40f,0x450), > + CASEMAP(0x0410,0x42f,0x430),*/ > + { 0x412, 0x1c80 }, /* reverse */ > + { 0x414, 0x1c81 }, /* reverse */ > + { 0x41e, 0x1c82 }, /* reverse */ > + { 0x421, 0x1c83 }, /* reverse */ > + { 0x422, 0x1c84 }, /* reverse */ > + { 0x422, 0x1c85 }, /* reverse */ > + { 0x42a, 0x1c86 }, /* reverse */ > + { 0x462, 0x463 }, > + { 0x462, 0x1c87 }, /* reverse */ > + > + { 0x04c0, 0x04cf}, > + /*CASELACE(0x04c1,0x4cd),*/ > + { 0x0528, 0x0529}, > + { 0x052a, 0x052b}, > + { 0x052c, 0x052d}, > + { 0x052e, 0x052f}, > + > + { 0x10c7, 0x2d27 }, > + { 0x10cd, 0x2d2d }, > + > { 0x1e60, 0x1e9b }, > + { 0x1e9b, 0x1e61 }, > { 0x1e9e, 0xdf }, > > { 0x1f59, 0x1f51 }, > @@ -158,25 +267,11 @@ static const unsigned short pairs[][2] = { > { 0x1f5d, 0x1f55 }, > { 0x1f5f, 0x1f57 }, > { 0x1fbc, 0x1fb3 }, > + { 0x1fbe, 0x3b9 }, This looks incorrect and probably needs an explanation of motivation as its own patch if it's actually correct. > { 0x1fcc, 0x1fc3 }, > { 0x1fec, 0x1fe5 }, > { 0x1ffc, 0x1ff3 }, > > - { 0x23a, 0x2c65 }, > - { 0x23b, 0x23c }, > - { 0x23d, 0x19a }, > - { 0x23e, 0x2c66 }, > - { 0x241, 0x242 }, > - { 0x243, 0x180 }, > - { 0x244, 0x289 }, > - { 0x245, 0x28c }, > - { 0x3f4, 0x3b8 }, > - { 0x3f9, 0x3f2 }, > - { 0x3fd, 0x37b }, > - { 0x3fe, 0x37c }, > - { 0x3ff, 0x37d }, > - { 0x4c0, 0x4cf }, > - > { 0x2126, 0x3c9 }, > { 0x212a, 'k' }, > { 0x212b, 0xe5 }, > @@ -196,25 +291,25 @@ static const unsigned short pairs[][2] = { > { 0x2c7f, 0x240 }, > { 0x2cf2, 0x2cf3 }, > > + { 0xa64a, 0xa64b }, > + { 0xa64a, 0x1c88 }, /* reverse */ > + > { 0xa77d, 0x1d79 }, > { 0xa78b, 0xa78c }, > { 0xa78d, 0x265 }, > { 0xa7aa, 0x266 }, > > - { 0x10c7, 0x2d27 }, > - { 0x10cd, 0x2d2d }, > + { 0xa7ab, 0x25c }, /* Unicode 7.0 */ > + { 0xa7ac, 0x261 }, /* Unicode 7.0 */ > + { 0xa7ad, 0x26c }, /* Unicode 7.0 */ > + { 0xa7ae, 0x26a }, /* Unicode 9.0 */ > + { 0xa7b0, 0x29e }, /* Unicode 7.0 */ > + { 0xa7b1, 0x287 }, /* Unicode 7.0 */ > + { 0xa7b2, 0x29d }, /* Unicode 7.0 */ > + { 0xa7b3, 0xab53 }, /* Unicode 8.0 */ > + { 0xa7b4, 0xa7b5 }, /* Unicode 8.0 */ > > - /* bogus greek 'symbol' letters */ > - { 0x376, 0x377 }, > - { 0x39c, 0xb5 }, > - { 0x392, 0x3d0 }, > - { 0x398, 0x3d1 }, > - { 0x3a6, 0x3d5 }, > - { 0x3a0, 0x3d6 }, > - { 0x39a, 0x3f0 }, > - { 0x3a1, 0x3f1 }, > - { 0x395, 0x3f5 }, > - { 0x3cf, 0x3d7 }, > + { 0xa7b6, 0xa7b7 }, /* Unicode 8.0 */ > > { 0,0 } > }; > @@ -229,29 +324,47 @@ static wchar_t __towcase(wchar_t wc, int lower) > if (!iswalpha(wc) > || (unsigned)wc - 0x0600 <= 0x0fff-0x0600 > || (unsigned)wc - 0x2e00 <= 0xa63f-0x2e00 > - || (unsigned)wc - 0xa800 <= 0xfeff-0xa800) > + || (unsigned)wc - 0xa800 <= 0xab69-0xa800 > + || (unsigned)wc - 0xabc0 <= 0xfeff-0xabc0) > return wc; > /* special case because the diff between upper/lower is too big */ > - if (lower && (unsigned)wc - 0x10a0 < 0x2e) > + if (lower && (unsigned)wc - 0x10a0 < 0x2e) { > if (wc>0x10c5 && wc != 0x10c7 && wc != 0x10cd) return wc; > else return wc + 0x2d00 - 0x10a0; > - if (!lower && (unsigned)wc - 0x2d00 < 0x26) > + } > + if (!lower && (unsigned)wc - 0x2d00 < 0x26) { > if (wc>0x2d25 && wc != 0x2d27 && wc != 0x2d2d) return wc; > else return wc + 0x10a0 - 0x2d00; > + } > for (i=0; casemaps[i].len; i++) { > int base = casemaps[i].upper + (lmask & casemaps[i].lower); > + assert(i>0 ? casemaps[i].upper >= casemaps[i-1].upper : 1); As musl isn't built with -DNDEBUG, this will actually cause runtime checks and calls into __asser_fail (and thus stdio) to be emitted into the ctype code. If a runtime check were actually desirable to catch some sort of UB, it should use a_crash(), but here it just seems to be a static assertion getting turned into an expensive runtime one because there's no practical way to test it at build time. My leaning would be to just leave these out; if some sort of checks of the tables are desirable that would be its own patch anyway. > if ((unsigned)wc-base < casemaps[i].len) { > if (casemaps[i].lower == 1) > return wc + lower - ((wc-casemaps[i].upper)&1); > return wc + lmul*casemaps[i].lower; > } > + if (lower && casemaps[i].upper > wc) > + break; > } > for (i=0; pairs[i][1-lower]; i++) { > + assert(i>0 ? pairs[i][0] >= pairs[i-1][0] : 1); > if (pairs[i][1-lower] == wc) > return pairs[i][lower]; > + if (lower && pairs[i][0] > wc) > + break; > + } > + for (i=0; casemapsl[i].len; i++) { > + unsigned long base = casemapsl[i].upper + (lmask & casemapsl[i].lower); > + assert(i>0 ? casemapsl[i].upper >= casemapsl[i-1].upper : 1); > + if ((unsigned)wc-base < casemapsl[i].len) { > + if (casemapsl[i].lower == 1) > + return wc + lower - ((wc-casemapsl[i].upper)&1); > + return wc + lmul*casemapsl[i].lower; > + } > + if (lower && casemaps[i].upper > wc) > + break; > } > - if ((unsigned)wc - (0x10428 - 0x28*lower) < 0x28) > - return wc - 0x28 + 0x50*lower; > return wc; > } Likewise for the rest of the asserts. Rich ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-10-25 18:38 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-09-13 8:25 [PATCH] towupper/towlower: Update to Unicode 9.0 Reini Urban 2017-09-13 10:05 ` Reini Urban 2017-09-13 18:13 ` Rich Felker 2017-10-20 9:00 ` Reini Urban 2017-10-25 18:38 ` Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).