From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: Efficient case mapping
Date: Wed, 7 Mar 2018 13:25:36 -0500 [thread overview]
Message-ID: <20180307182536.GO1436@brightrain.aerifal.cx> (raw)
In-Reply-To: <20180307034852.GL1436@brightrain.aerifal.cx>
[-- Attachment #1: Type: text/plain, Size: 4201 bytes --]
On Tue, Mar 06, 2018 at 10:48:52PM -0500, Rich Felker wrote:
> On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote:
> > Further optimizing: We're not using all the data we already have. Only
> > alphabetic characters can have case mappings, and we already have an
> > iswalpha table; the above tables are highly correlated with it.
> >
> > If we short-circuit out the whole operation with iswalpha, then the
> > "000 - no case mappings" value is no longer common. Instead, "zero is
> > one of the common deltas" can be included or not on a per-block basis.
> >
> > We can also add, along with the per-block common deltas, a table of
> > which cases each delta goes with, and assign new bit meanings:
> >
> > 00 - is case[0] for block, map to ~case[0] via delta[0]
> > 01 - is case[1] for block, map to ~case[1] via delta[1]
> > 10 - is case[2] for block, map to ~case[2] via delta[2]
> > 11 - is exception, map via binary search of exceptions table
> >
> > If needed we could add back a third bit here, but I think we can get
> > by without it.
> >
> > Note that, because iswalpha is now short-circuiting non-alphabetic
> > characters, we can just assign an arbitrary table value (best: 00) to
> > all non-alphabetic (including unassigned) characters. This drops most
> > of the "simple blocks" (blocks with just 0 or a single delta) down to
> > trivial (elided) blocks.
> >
> > These last ideas ("further optimizing:" and beyond) were worked out
> > while writing this email, and I don't yet have any analysis/test code
> > to go with them. Will follow up later with results. With some luck,
> > final version will be barely-larger than current code, extensible, and
> > fast. :)
>
> I now have results for this approach which I'm attaching. The output
> is from my block analysis tool. For each block, it displays case
> mapping rules sorted by decreasing frequency, with characters that
> don't matter (because they're non-alphabetic) merged into the most
> frequent rule. The syntax is:
>
> rule d(t) [f]
>
> where t is the character type (n = no case, u = uppercase, l =
> lowercase, x = exceptional/titlecase), d is the delta to the opposite
> case character, and f is the frequency of the rule (again, counting
> merging of dontcare into the most frequent).
>
> While the vast majority of blocks are fine with 0-2 bits of rule data,
> a few do have more than 3 fairly-frequent rules, the worst being
> blocks 04, 05, and 2C. Having an extra bit would save at least 169
> exceptions in those, and thus ~676 bytes. The fixed cost of an extra
> table bit is 512 bytes (for the first level), plus 32 bytes per block
> that contains data, so it's pretty close to break-even, but keeping
> the exception table small does help performance too. My leaning would
> be to add the extra bit, but write a cost function so that it's only
> used where the 4th-7th frequencies for the block total more than 8
> (the number of exceptions to break even at 32 bytes).
>
> With that approach it looks like we have about 220 entries in the
> exception tables, 4-5 blocks with a third bit, 21 blocks with a second
> bit, 24 blocks with a first bit, and 3 table headers. That comes to
> about 880+160+672+768+1536 = 4k.
>
> One thing I haven't accounted for is storing the 3 (or 7) deltas per
> block. The raw deltas are signed 17-bit, but if we assume mappings
> don't cross planes they can be thought of as unsigned 16-bit on the
> low 16 bits of the codepoint. But that's still pretty large -- 3
> 16-bit numbers by 512 blocks is 3k. Fortunately I think we can do a
> lot better. First, have a table of all the deltas that ever appear
> (about 20). Then, for each block, have 0-7 single-byte indices into
> this table, and a single-byte index to the sequence of indices. Now we
> just have about 2*20+512+5*7+16*3+3*1 = 638 bytes.
>
> 4.6k is a good bit more than I was planning on spending here, so maybe
> there are still improvements to be made...
And here is a draft of the code that would use the mapping. The
#ifdef'd external tables are to allow compiling the code without
actually having table data yet; otherwise gcc optimizes out the whole
translation unit. :-)
Comments?
Rich
[-- Attachment #2: casemap.c --]
[-- Type: text/plain, Size: 1803 bytes --]
#include <wchar.h>
#include <wctype.h>
#if 0
static const unsigned char tab0[] = {0};
static const unsigned char tab1[] = {0};
static const unsigned char tab2[] = {0};
static const unsigned char rulebases[512] = {0};
static const int rules[] = {0};
static const unsigned char exceptions[][2] = {0};
#else
extern const unsigned char tab0[];
extern const unsigned char tab1[];
extern const unsigned char tab2[];
extern const unsigned char rulebases[512];
extern const int rules[];
extern const unsigned char exceptions[][2];
#endif
static int casing_delta(unsigned c, int dir)
{
unsigned b, x, y, v, rt, xb, xn;
int r, rd;
if (c >= 0x20000) return 0;
b = c>>8;
c &= 255;
x = c>>3;
y = c&7;
v = (tab0[tab0[b]*32+x]>>y&1)
+ (tab1[tab1[b]*32+x]>>y&1)*2
+ (tab2[tab2[b]*32+x]>>y&1)*4;
/* use the bit vector out of the tables as an index into
* a block-specific set of rules and decode the rule into
* a type and a case-mapping delta. */
r = rules[rulebases[b]+v];
rt = r & 255;
rd = r >> 8;
/* rules 0/1 are simple lower/upper case with a delta.
* apply according to desired mapping direction. */
if (rt < 2) return rd & -(rt^dir);
/* binary search. endpoints of the binary search for
* this block are stored in the rule delta field. */
xn = rd & 0xff;
xb = (unsigned)rd >> 8;
while (xn) {
unsigned try = exceptions[xb+xn/2][0];
if (try == c) {
r = rules[exceptions[xb+xn/2][1]];
rt = r & 255;
rd = r >> 8;
if (rt < 2) return rd & -(rt^dir);
/* Hard-coded for the four exceptional titlecase */
return dir ? -1 : 1;
} else if (try > c) {
xn /= 2;
} else {
xb += xn/2;
xn -= xn/2;
}
}
return 0;
}
wint_t towlower(wint_t wc)
{
return wc + casing_delta(wc, 0);
}
wint_t towupper(wint_t wc)
{
return wc + casing_delta(wc, 1);
}
next prev parent reply other threads:[~2018-03-07 18:25 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-06 20:27 Rich Felker
2018-03-07 3:48 ` Rich Felker
2018-03-07 18:25 ` Rich Felker [this message]
2018-04-05 17:09 ` Rich Felker
2018-03-07 15:44 ` Rich Felker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180307182536.GO1436@brightrain.aerifal.cx \
--to=dalias@libc.org \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).