From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12587
Path: news.gmane.org!.POSTED!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Efficient case mapping
Date: Tue, 6 Mar 2018 15:27:41 -0500
Message-ID: <20180306202741.GK1436@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: blaine.gmane.org 1520367962 32476 195.159.176.226 (6 Mar 2018 20:26:02 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Tue, 6 Mar 2018 20:26:02 +0000 (UTC)
User-Agent: Mutt/1.5.21 (2010-09-15)
To: musl@lists.openwall.com
Original-X-From: musl-return-12603-gllmg-musl=m.gmane.org@lists.openwall.com Tue Mar 06 21:25:58 2018
Return-path: <musl-return-12603-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-12603-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1etJ9i-00075l-Ht
	for gllmg-musl@m.gmane.org; Tue, 06 Mar 2018 21:25:50 +0100
Original-Received: (qmail 11491 invoked by uid 550); 6 Mar 2018 20:27:54 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 11455 invoked from network); 6 Mar 2018 20:27:54 -0000
Content-Disposition: inline
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:12587
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/12587>

Replacing the compact but slow linear-search-based case mapping tables
with something comparably compact but fast has been on the hidden
roadmap for a while now, and I think I have a good solution:

Analysis of Unicode blocks (256-char) shows that most blocks have
either no case mappings at all, or just a single common delta between
lower and upper case for the whole block. There are about 11 blocks
that break this rule and have more. The most unique deltas with
nontrivial frequency (same delta used by more than a single-digit
number of characters) in any block is 3, appearing only in block 04.

As such, a single-direction case mapping can be achieved as follows.
Use two multi-level bit tables (same format as ctype bits) to assign a
2-bit number to each codepoint, and 2 integer deltas (the two most
common) to each block. The meanings are:

00 - no case mapping
01 - map via block's first delta
10 - map via block's second delta
11 - exception; binary search tiny table of exceptions

Obviously this can be extended to two directions via a second pair of
tables, but I think we can do better, just 3 bits total:

000 - no case mappings
001 - is lowercase; map to uppercase via block's first toupper delta
010 - is uppercase; map to lowercase via block's first tolower delta
011 - exception; binary search tiny table of exceptions
100 - is lowercase; map to uppercase via block's second toupper delta
101 - is lowercase; map to uppercase via block's third toupper delta
110 - is uppercase; map to lowercase via block's second tolower delta
111 - is uppercase; map to lowercase via block's third tolower delta

Most blocks will use no bits at all (no case mappings). A few blocks
that only contain lowercase with a single delta will use just one bit.
Many blocks with case mappings will use 2 bits (only one frequent
delta for each direction). The remaining few will use 3 bits, and now
they can code not just 2 but 3 frequent deltas in each direction,
getting rid of a lot of exceptions.

Some quick size estimates (hope I didn't mess this up):

11 complex blocks -- 11 blocks * 256/8 bytes * 3 bits
7 simple blocks -- 7 blocks * 256/8 bytes * 1 bit
Top-level tables -- 512 blocks * 1 byte * 3 tables
Exceptions -- ~96 pairs * 4 bytes * 2 sortings

Looks like it comes to about 3k total.

Further optimizing: We're not using all the data we already have. Only
alphabetic characters can have case mappings, and we already have an
iswalpha table; the above tables are highly correlated with it.

If we short-circuit out the whole operation with iswalpha, then the
"000 - no case mappings" value is no longer common. Instead, "zero is
one of the common deltas" can be included or not on a per-block basis.

We can also add, along with the per-block common deltas, a table of
which cases each delta goes with, and assign new bit meanings:

00 - is case[0] for block, map to ~case[0] via delta[0]
01 - is case[1] for block, map to ~case[1] via delta[1]
10 - is case[2] for block, map to ~case[2] via delta[2]
11 - is exception, map via binary search of exceptions table

If needed we could add back a third bit here, but I think we can get
by without it.

Note that, because iswalpha is now short-circuiting non-alphabetic
characters, we can just assign an arbitrary table value (best: 00) to
all non-alphabetic (including unassigned) characters. This drops most
of the "simple blocks" (blocks with just 0 or a single delta) down to
trivial (elided) blocks.

These last ideas ("further optimizing:" and beyond) were worked out
while writing this email, and I don't yet have any analysis/test code
to go with them. Will follow up later with results. With some luck,
final version will be barely-larger than current code, extensible, and
fast. :)

Rich