Re: Efficient case mapping

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: Efficient case mapping
Date: Wed, 7 Mar 2018 13:25:36 -0500	[thread overview]
Message-ID: <20180307182536.GO1436@brightrain.aerifal.cx> (raw)
In-Reply-To: <20180307034852.GL1436@brightrain.aerifal.cx>

[-- Attachment #1: Type: text/plain, Size: 4201 bytes --]

On Tue, Mar 06, 2018 at 10:48:52PM -0500, Rich Felker wrote:
> On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote:
> > Further optimizing: We're not using all the data we already have. Only
> > alphabetic characters can have case mappings, and we already have an
> > iswalpha table; the above tables are highly correlated with it.
> > 
> > If we short-circuit out the whole operation with iswalpha, then the
> > "000 - no case mappings" value is no longer common. Instead, "zero is
> > one of the common deltas" can be included or not on a per-block basis.
> > 
> > We can also add, along with the per-block common deltas, a table of
> > which cases each delta goes with, and assign new bit meanings:
> > 
> > 00 - is case[0] for block, map to ~case[0] via delta[0]
> > 01 - is case[1] for block, map to ~case[1] via delta[1]
> > 10 - is case[2] for block, map to ~case[2] via delta[2]
> > 11 - is exception, map via binary search of exceptions table
> > 
> > If needed we could add back a third bit here, but I think we can get
> > by without it.
> > 
> > Note that, because iswalpha is now short-circuiting non-alphabetic
> > characters, we can just assign an arbitrary table value (best: 00) to
> > all non-alphabetic (including unassigned) characters. This drops most
> > of the "simple blocks" (blocks with just 0 or a single delta) down to
> > trivial (elided) blocks.
> > 
> > These last ideas ("further optimizing:" and beyond) were worked out
> > while writing this email, and I don't yet have any analysis/test code
> > to go with them. Will follow up later with results. With some luck,
> > final version will be barely-larger than current code, extensible, and
> > fast. :)
> 
> I now have results for this approach which I'm attaching. The output
> is from my block analysis tool. For each block, it displays case
> mapping rules sorted by decreasing frequency, with characters that
> don't matter (because they're non-alphabetic) merged into the most
> frequent rule. The syntax is:
> 
>   rule d(t) [f]
> 
> where t is the character type (n = no case, u = uppercase, l =
> lowercase, x = exceptional/titlecase), d is the delta to the opposite
> case character, and f is the frequency of the rule (again, counting
> merging of dontcare into the most frequent).
> 
> While the vast majority of blocks are fine with 0-2 bits of rule data,
> a few do have more than 3 fairly-frequent rules, the worst being
> blocks 04, 05, and 2C. Having an extra bit would save at least 169
> exceptions in those, and thus ~676 bytes. The fixed cost of an extra
> table bit is 512 bytes (for the first level), plus 32 bytes per block
> that contains data, so it's pretty close to break-even, but keeping
> the exception table small does help performance too. My leaning would
> be to add the extra bit, but write a cost function so that it's only
> used where the 4th-7th frequencies for the block total more than 8
> (the number of exceptions to break even at 32 bytes).
> 
> With that approach it looks like we have about 220 entries in the
> exception tables, 4-5 blocks with a third bit, 21 blocks with a second
> bit, 24 blocks with a first bit, and 3 table headers. That comes to
> about 880+160+672+768+1536 = 4k.
> 
> One thing I haven't accounted for is storing the 3 (or 7) deltas per
> block. The raw deltas are signed 17-bit, but if we assume mappings
> don't cross planes they can be thought of as unsigned 16-bit on the
> low 16 bits of the codepoint. But that's still pretty large -- 3
> 16-bit numbers by 512 blocks is 3k. Fortunately I think we can do a
> lot better. First, have a table of all the deltas that ever appear
> (about 20). Then, for each block, have 0-7 single-byte indices into
> this table, and a single-byte index to the sequence of indices. Now we
> just have about 2*20+512+5*7+16*3+3*1 = 638 bytes.
> 
> 4.6k is a good bit more than I was planning on spending here, so maybe
> there are still improvements to be made...

And here is a draft of the code that would use the mapping. The
#ifdef'd external tables are to allow compiling the code without
actually having table data yet; otherwise gcc optimizes out the whole
translation unit. :-)

Comments?

Rich

[-- Attachment #2: casemap.c --]
[-- Type: text/plain, Size: 1803 bytes --]

#include <wchar.h>
#include <wctype.h>

#if 0
static const unsigned char tab0[] = {0};
static const unsigned char tab1[] = {0};
static const unsigned char tab2[] = {0};

static const unsigned char rulebases[512] = {0};
static const int rules[] = {0};

static const unsigned char exceptions[][2] = {0};
#else
extern const unsigned char tab0[];
extern const unsigned char tab1[];
extern const unsigned char tab2[];

extern const unsigned char rulebases[512];
extern const int rules[];

extern const unsigned char exceptions[][2];
#endif

static int casing_delta(unsigned c, int dir)
{
	unsigned b, x, y, v, rt, xb, xn;
	int r, rd;

	if (c >= 0x20000) return 0;

	b = c>>8;
	c &= 255;
	x = c>>3;
	y = c&7;
	v = (tab0[tab0[b]*32+x]>>y&1)
	  + (tab1[tab1[b]*32+x]>>y&1)*2
	  + (tab2[tab2[b]*32+x]>>y&1)*4;

	/* use the bit vector out of the tables as an index into
	 * a block-specific set of rules and decode the rule into
	 * a type and a case-mapping delta. */
	r = rules[rulebases[b]+v];
	rt = r & 255;
	rd = r >> 8;

	/* rules 0/1 are simple lower/upper case with a delta.
	 * apply according to desired mapping direction. */
	if (rt < 2) return rd & -(rt^dir);

	/* binary search. endpoints of the binary search for
	 * this block are stored in the rule delta field. */
	xn = rd & 0xff;
	xb = (unsigned)rd >> 8;
	while (xn) {
		unsigned try = exceptions[xb+xn/2][0];
		if (try == c) {
			r = rules[exceptions[xb+xn/2][1]];
			rt = r & 255;
			rd = r >> 8;
			if (rt < 2) return rd & -(rt^dir);
			/* Hard-coded for the four exceptional titlecase */
			return dir ? -1 : 1;
		} else if (try > c) {
			xn /= 2;
		} else {
			xb += xn/2;
			xn -= xn/2;
		}
	}
	return 0;
}

wint_t towlower(wint_t wc)
{
	return wc + casing_delta(wc, 0);
}

wint_t towupper(wint_t wc)
{
	return wc + casing_delta(wc, 1);
}

next prev parent reply	other threads:[~2018-03-07 18:25 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-06 20:27 Rich Felker
2018-03-07  3:48 ` Rich Felker
2018-03-07 18:25   ` Rich Felker [this message]
2018-04-05 17:09     ` Rich Felker
2018-03-07 15:44 ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180307182536.GO1436@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).