mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: Efficient case mapping
Date: Tue, 6 Mar 2018 22:48:52 -0500	[thread overview]
Message-ID: <20180307034852.GL1436@brightrain.aerifal.cx> (raw)
In-Reply-To: <20180306202741.GK1436@brightrain.aerifal.cx>

[-- Attachment #1: Type: text/plain, Size: 3761 bytes --]

On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote:
> Further optimizing: We're not using all the data we already have. Only
> alphabetic characters can have case mappings, and we already have an
> iswalpha table; the above tables are highly correlated with it.
> 
> If we short-circuit out the whole operation with iswalpha, then the
> "000 - no case mappings" value is no longer common. Instead, "zero is
> one of the common deltas" can be included or not on a per-block basis.
> 
> We can also add, along with the per-block common deltas, a table of
> which cases each delta goes with, and assign new bit meanings:
> 
> 00 - is case[0] for block, map to ~case[0] via delta[0]
> 01 - is case[1] for block, map to ~case[1] via delta[1]
> 10 - is case[2] for block, map to ~case[2] via delta[2]
> 11 - is exception, map via binary search of exceptions table
> 
> If needed we could add back a third bit here, but I think we can get
> by without it.
> 
> Note that, because iswalpha is now short-circuiting non-alphabetic
> characters, we can just assign an arbitrary table value (best: 00) to
> all non-alphabetic (including unassigned) characters. This drops most
> of the "simple blocks" (blocks with just 0 or a single delta) down to
> trivial (elided) blocks.
> 
> These last ideas ("further optimizing:" and beyond) were worked out
> while writing this email, and I don't yet have any analysis/test code
> to go with them. Will follow up later with results. With some luck,
> final version will be barely-larger than current code, extensible, and
> fast. :)

I now have results for this approach which I'm attaching. The output
is from my block analysis tool. For each block, it displays case
mapping rules sorted by decreasing frequency, with characters that
don't matter (because they're non-alphabetic) merged into the most
frequent rule. The syntax is:

  rule d(t) [f]

where t is the character type (n = no case, u = uppercase, l =
lowercase, x = exceptional/titlecase), d is the delta to the opposite
case character, and f is the frequency of the rule (again, counting
merging of dontcare into the most frequent).

While the vast majority of blocks are fine with 0-2 bits of rule data,
a few do have more than 3 fairly-frequent rules, the worst being
blocks 04, 05, and 2C. Having an extra bit would save at least 169
exceptions in those, and thus ~676 bytes. The fixed cost of an extra
table bit is 512 bytes (for the first level), plus 32 bytes per block
that contains data, so it's pretty close to break-even, but keeping
the exception table small does help performance too. My leaning would
be to add the extra bit, but write a cost function so that it's only
used where the 4th-7th frequencies for the block total more than 8
(the number of exceptions to break even at 32 bytes).

With that approach it looks like we have about 220 entries in the
exception tables, 4-5 blocks with a third bit, 21 blocks with a second
bit, 24 blocks with a first bit, and 3 table headers. That comes to
about 880+160+672+768+1536 = 4k.

One thing I haven't accounted for is storing the 3 (or 7) deltas per
block. The raw deltas are signed 17-bit, but if we assume mappings
don't cross planes they can be thought of as unsigned 16-bit on the
low 16 bits of the codepoint. But that's still pretty large -- 3
16-bit numbers by 512 blocks is 3k. Fortunately I think we can do a
lot better. First, have a table of all the deltas that ever appear
(about 20). Then, for each block, have 0-7 single-byte indices into
this table, and a single-byte index to the sequence of indices. Now we
just have about 2*20+512+5*7+16*3+3*1 = 638 bytes.

4.6k is a good bit more than I was planning on spending here, so maybe
there are still improvements to be made...

Rich

[-- Attachment #2: caseblocks1.txt --]
[-- Type: text/plain, Size: 4405 bytes --]

block 0:
  rule 32(u) [195]
  rule -32(l) [56]
  rule 0(n) [3]
  rule 7615(l) [1]
  rule 121(l) [1]
block 1:
  rule 1(u) [99]
  rule -1(l) [99]
  rule 0(n) [14]
  rule 2(u) [4]
  rule 0(x) [4]
  rule -2(l) [4]
  rule 218(u) [3]
  rule 205(u) [3]
  rule 217(u) [2]
  rule 211(u) [2]
  rule 219(u) [1]
  rule 214(u) [1]
  rule 213(u) [1]
  rule 210(u) [1]
  rule 209(u) [1]
  rule 207(u) [1]
  rule 206(u) [1]
  rule 203(u) [1]
  rule 202(u) [1]
  rule 195(l) [1]
  rule 163(l) [1]
  rule 130(l) [1]
  rule 97(l) [1]
  rule 79(u) [1]
  rule 56(l) [1]
  rule -56(u) [1]
  rule -79(l) [1]
  rule -97(u) [1]
  rule -121(u) [1]
  rule -199(u) [1]
  rule -232(l) [1]
block 2:
  rule 0(n) [147]
  rule 1(u) [32]
  rule -1(l) [32]
  rule -205(l) [3]
  rule -218(l) [3]
  rule 42308(l) [2]
  rule 10815(l) [2]
  rule -211(l) [2]
  rule -217(l) [2]
  rule 42319(l) [1]
  rule 42315(l) [1]
  rule 42305(l) [1]
  rule 42282(l) [1]
  rule 42280(l) [1]
  rule 42261(l) [1]
  rule 42258(l) [1]
  rule 10795(u) [1]
  rule 10792(u) [1]
  rule 10783(l) [1]
  rule 10782(l) [1]
  rule 10780(l) [1]
  rule 10749(l) [1]
  rule 10743(l) [1]
  rule 10727(l) [1]
  rule 71(u) [1]
  rule 69(u) [1]
  rule -69(l) [1]
  rule -71(l) [1]
  rule -130(u) [1]
  rule -163(u) [1]
  rule -195(u) [1]
  rule -202(l) [1]
  rule -203(l) [1]
  rule -206(l) [1]
  rule -207(l) [1]
  rule -209(l) [1]
  rule -210(l) [1]
  rule -213(l) [1]
  rule -214(l) [1]
  rule -219(l) [1]
block 3:
  rule 32(u) [152]
  rule -32(l) [26]
  rule 1(u) [17]
  rule -1(l) [17]
  rule 0(n) [8]
  rule 130(l) [3]
  rule 37(u) [3]
  rule -37(l) [3]
  rule -130(u) [3]
  rule 63(u) [2]
  rule -63(l) [2]
  rule 116(u) [1]
  rule 84(l) [1]
  rule 64(u) [1]
  rule 38(u) [1]
  rule 8(u) [1]
  rule 7(l) [1]
  rule -7(u) [1]
  rule -8(l) [1]
  rule -31(l) [1]
  rule -38(l) [1]
  rule -47(l) [1]
  rule -54(l) [1]
  rule -57(l) [1]
  rule -60(u) [1]
  rule -62(l) [1]
  rule -64(l) [1]
  rule -80(l) [1]
  rule -86(l) [1]
  rule -96(l) [1]
  rule -116(l) [1]
block 4:
  rule 1(u) [83]
  rule -1(l) [75]
  rule 32(u) [32]
  rule -32(l) [32]
  rule 80(u) [16]
  rule -80(l) [16]
  rule 15(u) [1]
  rule -15(l) [1]
block 5:
  rule 0(n) [132]
  rule 48(u) [38]
  rule -48(l) [38]
  rule 1(u) [24]
  rule -1(l) [24]
block 10:
  rule 0(n) [216]
  rule 7264(u) [40]
block 13:
  rule 0(n) [164]
  rule 38864(u) [80]
  rule 8(u) [6]
  rule -8(l) [6]
block 1c:
  rule 0(n) [247]
  rule -6242(l) [2]
  rule 35266(l) [1]
  rule -6181(l) [1]
  rule -6236(l) [1]
  rule -6243(l) [1]
  rule -6244(l) [1]
  rule -6253(l) [1]
  rule -6254(l) [1]
block 1d:
  rule 0(n) [254]
  rule 35332(l) [1]
  rule 3814(l) [1]
block 1e:
  rule 1(u) [123]
  rule -1(l) [123]
  rule 0(n) [8]
  rule -59(l) [1]
block 1f:
  rule 8(l) [116]
  rule -8(u) [78]
  rule 0(n) [25]
  rule 86(l) [4]
  rule -86(u) [4]
  rule 9(l) [3]
  rule -9(u) [3]
  rule 128(l) [2]
  rule 126(l) [2]
  rule 112(l) [2]
  rule 100(l) [2]
  rule 74(l) [2]
  rule -74(u) [2]
  rule -100(u) [2]
  rule -112(u) [2]
  rule -126(u) [2]
  rule -128(u) [2]
  rule 7(l) [1]
  rule -7(u) [1]
  rule -7205(l) [1]
block 21:
  rule 0(n) [217]
  rule 16(u) [16]
  rule -16(l) [16]
  rule 28(u) [1]
  rule 1(u) [1]
  rule -1(l) [1]
  rule -28(l) [1]
  rule -7517(u) [1]
  rule -8262(u) [1]
  rule -8383(u) [1]
block 24:
  rule 26(u) [230]
  rule -26(l) [26]
block 2c:
  rule 1(u) [82]
  rule -1(l) [59]
  rule 48(u) [47]
  rule -48(l) [47]
  rule 0(n) [10]
  rule -10815(u) [2]
  rule -3814(u) [1]
  rule -10727(u) [1]
  rule -10743(u) [1]
  rule -10749(u) [1]
  rule -10780(u) [1]
  rule -10782(u) [1]
  rule -10783(u) [1]
  rule -10792(l) [1]
  rule -10795(l) [1]
block 2d:
  rule 0(n) [216]
  rule -7264(l) [40]
block a6:
  rule 0(n) [182]
  rule 1(u) [37]
  rule -1(l) [37]
block a7:
  rule 1(u) [151]
  rule -1(l) [60]
  rule 0(n) [34]
  rule -42308(u) [2]
  rule 928(u) [1]
  rule -35332(u) [1]
  rule -42258(u) [1]
  rule -42261(u) [1]
  rule -42280(u) [1]
  rule -42282(u) [1]
  rule -42305(u) [1]
  rule -42315(u) [1]
  rule -42319(u) [1]
block ab:
  rule 0(n) [175]
  rule -38864(l) [80]
  rule -928(l) [1]
block ff:
  rule 0(n) [204]
  rule 32(u) [26]
  rule -32(l) [26]
block 104:
  rule 0(n) [104]
  rule 40(u) [76]
  rule -40(l) [76]
block 10c:
  rule 0(n) [154]
  rule 64(u) [51]
  rule -64(l) [51]
block 118:
  rule 32(u) [223]
  rule -32(l) [32]
  rule 0(n) [1]
block 1e9:
  rule 34(u) [221]
  rule -34(l) [34]
  rule 0(n) [1]

  reply	other threads:[~2018-03-07  3:48 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-06 20:27 Rich Felker
2018-03-07  3:48 ` Rich Felker [this message]
2018-03-07 18:25   ` Rich Felker
2018-04-05 17:09     ` Rich Felker
2018-03-07 15:44 ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180307034852.GL1436@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).